**From Samples to Insights into Metabolism: Uncovering Biologically Relevant Information in LC-HRMS Metabolomics Data**

#### **Julijana Ivanisevic 1,\* and Elizabeth J. Want 2,\***


Received: 4 November 2019; Accepted: 12 December 2019; Published: 17 December 2019

**Abstract:** Untargeted metabolomics (including lipidomics) is a holistic approach to biomarker discovery and mechanistic insights into disease onset and progression, and response to intervention. Each step of the analytical and statistical pipeline is crucial for the generation of high-quality, robust data. Metabolite identification remains the bottleneck in these studies; therefore, confidence in the data produced is paramount in order to maximize the biological output. Here, we outline the key steps of the metabolomics workflow and provide details on important parameters and considerations. Studies should be designed carefully to ensure appropriate statistical power and adequate controls. Subsequent sample handling and preparation should avoid the introduction of bias, which can significantly affect downstream data interpretation. It is not possible to cover the entire metabolome with a single platform; therefore, the analytical platform should reflect the biological sample under investigation and the question(s) under consideration. The large, complex datasets produced need to be pre-processed in order to extract meaningful information. Finally, the most time-consuming steps are metabolite identification, as well as metabolic pathway and network analysis. Here we discuss some widely used tools and the pitfalls of each step of the workflow, with the ultimate aim of guiding the reader towards the most efficient pipeline for their metabolomics studies.

**Keywords:** untargeted metabolomics; liquid chromatography–mass spectrometry (LC-MS); metabolism; experimental design; sample preparation; data processing; metabolite identification; univariate and multivariate statistics; metabolic pathway and network analysis

#### **1. Introduction**

It is assumed that metabolite identification remains a major challenge in untargeted mass spectrometry (MS)-based metabolomics. Is this indeed true? Should there be greater effort to design experiments in a smarter, more streamlined way, and to know how to reduce noise and redundancy in untargeted metabolomics datasets? For example, a meta-analysis comparative strategy can be used, where several pairwise comparisons are performed (with the same control group), followed by second-order or meta-analysis to prioritize the identification of the shared deregulated metabolites [1,2]. Here, we provide tips on how to design metabolomics experiments in an optimal way, considering sample size, confounders, and bias. We discuss important factors in sample preparation and describe how preparation approaches should be tailored to each biofluid or tissue. Methods should be simple, reproducible, and inexpensive, while preparation steps should not be biased for or against specific analytes, in order to maximize metabolome and/or lipidome coverage. We also summarize different liquid chromatography–mass spectrometry (LC-MS) strategies in order

to acquire high quality MS and MS/MS data (reversed phase (RP) LC and hydrophilic interaction liquid chromatography (HILIC) coupled to full scan high resolution (HR) MS data-dependent and data-independent acquisition (DDA and DIA)), while maximizing the metabolome and lipidome coverage, parameters to pay attention to for data pre-processing, and, specifically, feature annotation. Also covered are which criteria to use for data filtering (quality control, chemical and informatic noise removal, etc.), how to apply statistical analysis in the best way, how to facilitate metabolite identification (using computational approaches) and how to translate the results in a biochemically relevant context (metabolite set enrichment analysis (MSEA), overrepresentation analysis (ORA), metabolic network analysis). We emphasize the importance in metabolomics studies of employing quality control (QC) strategies. QC samples, typically a pool of all study samples, can be used to both condition the analytical column and to monitor stability throughout the run. Expanding the polar metabolome and lipidome coverage, removal of noise and redundancy, and consideration of metabolic capacities of a model organism (i.e., biochemical reactions that can be performed by the specific organism, species, genus, etc.) are essential for generation of well-founded hypotheses from untargeted assays. We show how the mass analyzer for untargeted assays should harness high mass accuracy and resolution, and the ability to perform fragmentation or MS/MS experiments for structural elucidation. Many different software exist for the extraction of peaks (metabolite features) from the data, the deconvolution of such data, and the subsequent analysis in both multivariate and univariate ways. There are many statistical tools available, which aim to streamline and aid interpretation, of which we endeavor to summarize and evaluate some of the most commonly used. Finally, we highlight the lack of quantitative data and the need to validate these data-driven hypotheses using targeted quantification, with a focus on identified biochemical pathways associated with phenotype. These analyses will allow to go towards more mechanistic insights and, most importantly, allow for cross-laboratory and -study comparisons for intelligent data re-usage.

#### **2. Results**

#### *2.1. Considerations for Experimental Design*

Before starting any metabolomics study, it is important to consider the question(s) being asked. Many metabolomics studies are complex in design and may incorporate several classes, e.g., control subjects versus those receiving low and high dose of a drug (Figure 1), healthy subjects versus those with a benign condition and cancer (maybe several stages). It is vital that the study is designed to maximize useful information, whilst keeping costs and animal usage to a minimum. Ideally, you are aiming for the smallest number of experiments needed to produce the maximum amount of data and achieve precision, whilst addressing power and effect size, and accounting for confounding factors [3]. However, it is challenging to calculate the appropriate sample size for untargeted metabolomics studies, as metabolite changes are typically unknown and may be numerous. Further, the high dimensionality of the data and the large degree of correlation between the variables (metabolite features) adds to the complexity of the issue. Ideally, a pilot study should be conducted in order to gain an understanding of the expected effects; however, these are rarely performed due to logistical reasons (sample availability, cost, animal usage, ethics, etc.). Software such as MetaboAnalyst can aid in these calculations if pilot data are available [4]. Recently, Ebbels, et al. [5] proposed an approach to circumvent the need for obtaining preliminary data by using a multivariate simulation approach. Also publicly available is MetSizeR, which uses information from both the metabolomics experiment and the data analysis technique to simulate pilot data from a statistical model (where two groups are present). In order to estimate the required sample size, permutation-based techniques are applied to these simulated data [6]. Also important to consider is the nature of the experiment and the type of samples being analyzed. For instance, when using cell models, conditions can be tightly controlled, and thus sample numbers kept to a minimum (e.g., five replicates). Animal studies are also subject to fairly tight control in terms of age, housing, diet, underlying disease, etc. Therefore,

for ethical and practical reasons, sample numbers can also be kept low. However, humans prove to be much more challenging subjects. Except in a small number of situations, factors such as diet, exercise, and medication cannot be controlled, and so a much larger number of subjects is needed in order to be able to determine a "normal" range for metabolite levels, account for inter-individual variation, and be able to detect changes above baseline.

**Figure 1.** Common experimental designs. (**A**) Cross-over design involving a large patient cohort. Two drugs are administered sequentially to each patient, with a crucial washout period between each drug to enable the effects of each drug to be elucidated. (**B**) Factorial design, where both the gender of the subject and effect of the drug are being studied. (**C**) Common cross-sectional design in metabolomics studies, comparing controls and two drug dose levels in both genders.

#### 2.1.1. The Importance of Controls

It is extremely important to design an experiment containing the correct controls, in order to be able to associate observed metabolite changes with the condition being investigated [7,8]. The main types of controls to consider including are:


#### 2.1.2. Confounding Factors and Variables

There are several sources of variation in metabolomics studies. Firstly, there is the biological variation in the samples themselves. Factors which can affect the metabolic profile of individuals include diet, age, medication, underlying disease, and environmental factors [9]. These are a particular issue for human subjects, as many are difficult to control, but some will be pertinent to animal models as well. When considering cell models, fluctuations in metabolite levels over time must be considered, e.g., as the cells grow and cell density changes (different cell lines grow at different rates). It is important to measure both intra- and extracellular metabolite levels to ensure that the effects observed are due to the treatment and not natural fluctuations. Also important to consider is the time of sampling, as many metabolites are subject to circadian rhythm in human and animal models, particularly hormones in blood and urine. When considering blood samples, whether the subject is in a fasting or non-fasting state should be considered, as blood glucose, amino acid, and lipid levels will be affected dramatically. There is also the variability introduced through sample collection and handling. a large body of work is available in the literature considering these factors [10,11], which is beyond the scope of this review. In summary, blood collection tubes can impact the metabolite profile due, in part, to ion suppression from anticoagulants, e.g., ethylenediaminetetraacetic acid (EDTA) [12,13]. Some serum collection tubes contain polymers such as polyethylene glycol (PEG), which is detrimental to LC-MS analysis, masking the signals from potentially important metabolites. Another consideration if collecting serum samples is the time left to clot, as metabolites such as lactate are known to change as clotting time increases, thus changing the metabolite profile [14]. When collecting urine samples, the type of preservative used, e.g., sodium azide or boric acid, may impact upon the metabolite profile [15]. The storage temperature and number of freeze–thaw cycles that the samples undergo are also important, as metabolites may degrade [16,17]. Lastly, the metabolite extraction approach (e.g., liquid extraction versus solid phase extraction; Section 2.2), extraction solvents used, and diluent also impact the metabolite profile hugely. Although some approaches may be favored over others, it is still largely subjective and will vary between research groups. The key to reproducible metabolomics data is to maintain consistency between samples as much as possible and keep the number of sampling handling steps to a minimum.

#### 2.1.3. Which Experimental Design to Choose?

There are several different experimental design types to consider. Amongst the more common are completely randomized, crossover, and factorial designs [18–21]. Although commonly used due to their fairly simple nature, completely randomized designs are limited in the fact that they study the effect of one primary factor without considering other factors. This approach would not be recommended in a metabolomics study, due to likely confounding factors (see Section 2.1.2), which may have a large impact on the metabolite profile. However, in reality, randomized studies are conducted and the confounders considered at the data analysis steps. a solution to this may be to employ a crossover design, where there could be sequential application of several treatments to the same individual (Figure 1). This means that a subject acts as its own control, thus providing smaller within-individual variation. However, the following factors need to be considered: "carryover effect", "time-related effect", "reversible treatment", and "wash out period". Factorial designs investigate the effect of more than one factor simultaneously, such as gender of the subject and response to a treatment, and so have the potential to increase information obtained from single study.

#### *2.2. Sample Preparation Approaches*

Crucial to obtaining high quality metabolomics data is how the samples are prepared. There are many excellent papers in the literature concerning sample preparation for metabolomics studies [22–29] and individual methodologies are beyond the scope of this review. However, it is important to consider some key factors when designing the sample preparation approaches most appropriate for the biofluid or tissue of interest. These include (a) ease of method, i.e., it should be easily reproduced by different operators within the same laboratory and across laboratories; (b) there should be a minimal number of steps, so that technical/analytical variability is kept to a minimum; and (c) cost—a less expensive method will be favorable, so that it can be scaled up to larger sample numbers, such as in the case of epidemiological studies [30,31]. For untargeted metabolomics, it is desirable to use methods which do not bias for or against specific classes of analytes, so that as broad metabolite coverage as possible can be achieved [30,31]. However, it may be practical to prepare sample extracts for polar and non-polar metabolites separately, such as in the case of tissue samples [31]. In general, urine is a straightforward biofluid to prepare, as unless collected from subjects with proteinuria (or rodents), it will largely be free from protein, and so a simple centrifugation and dilution approach can be taken [30]. Be sure to ensure that the diluent used is compatible with the mode of chromatography to be subsequently employed. Plasma/serum and tissue samples require protein to be removed, which can be performed through the addition of cold organic solvent, often methanol, acetonitrile, isopropanol, or a butanol and methanol solution (BuMe), in a one-step approach [32–35]. Tissue samples require homogenization prior to protein precipitation, often using a bead beater [31]. For both blood and tissue samples, a biphasic extraction approach, such as the Bligh-Dyer or a variation (e.g., MTBE:MeOH:H2O), can be used [36]. Alternatively a two-step approach can be utilized, where sequential extraction of polar and non-polar metabolites is performed [37]. Particular care needs to be taken in the case of preparation of cell samples, where quenching is a crucial step in order to arrest metabolism [26,28]. It is also important to be aware of the stability of analytes, as some such as adenosine triphosphate (ATP) will degrade rapidly [26], and it may not be possible to measure these accurately. It is also important to randomize the sample preparation order, particularly in the case of large sample numbers, and to ensure that this preparation order is not the same as the analytical run order (Figure 2), so that systematic bias is minimized.

**Figure 2.** Setting up the data acquisition worklist to facilitate metabolite quantification and identification. Prior to batch run, the instrument should be conditioned (or "passivated") using the pooled quality control (QC) of biological samples. During the conditioning, high-quality MS/MS data can be acquired in a data-dependent acquisition (DDA) mode by taking advantage of iterative injections through the application of PC-driven exclusion (of ions for which the MS/MS data have already been acquired). In this way, the amount of acquired high-quality MS/MS data will be maximized. The batch run can start (and end) with the analysis of diluted QC series that will serve to remove the features whose response is not linear; however, this removal should be performed carefully by evaluating low abundance features and those with saturation issues. Finally, samples should be run in a randomized fashion (considering the most important confounding factors, such as disease, sex, age, etc., depending on the experiment) with pooled QCs every 4–10 samples (depending on the size of the batch). Extracted blanks can be analyzed after the sample run and used for the removal of background (chemical and informatic) noise. Abbreviations: MS/MS data—fragmentation pattern, HRMS—high-resolution mass spectrometry, DDA—data-dependent acquisition, DIA—data-independent acquisition, AIF—all ion fragmentation (on Agilent or Thermo systems), MSE—all ion fragmentation on Waters systems- , SWATH—sequential window acquisition of all theoretical mass spectra or DIA strategy on Sciex systems, SONAR—scanning quadrupole DIA or DIA strategy on Waters systems.

#### *2.3. Data Acquisition Strategies to Facilitate Metabolite Quantification and Identification*

The choice of technological platform and analytical strategy for sample analysis will be guided by the objective of the study, the metabolites of interest and the approach—untargeted or targeted—deemed most appropriate to answer the biological question. While Nuclear Magnetic Resonance (NMR) spectroscopy is endowed with high reproducibility and accuracy for metabolite measurement, MS-based technologies have made the most significant imprint in metabolomics following the introduction of electrospray ionization (ESI), which has considerably enhanced measurement sensitivity and thus promoted "omics scale" metabolite analysis [38,39]. Direct injection analytical strategies, such as flow-injection analysis (FIA), that do not apply any analyte separation have already provided an increased coverage of up to 200–300 metabolites. While this direct ionization strategy can be of particular interest in studies where high-throughput is essential, for example, in real-time metabolite profiling [40,41], it suffers from ion suppression, poor reproducibility, matrix effects, etc., allowing for only a small fraction of the polar metabolome to be putatively annotated based on accurate mass. As opposed to polar metabolites, a large body of evidence has demonstrated the value of direct infusion-based shotgun analysis for lipid identification. The latest strategies applied in shotgun lipidomics take advantage of the selective ionization of different classes of lipids in the ion source (i.e., intra-source separation under different conditions) and continuous direct injection of the sample, allowing for multi-dimensional MS analysis (i.e., multiple acquisitions in full scan and MS/MS scan modes), and thus, the unambiguous identification (including isobaric/isomeric species) and accurate quantification of lipid species (in two steps) [42,43]. Although the multi-dimensional mass spectrometry-shotgun lipidomics (MDMS-SL) improves most of the limitations related to classical shotgun lipidomics, it is relatively low-throughput and still suffers from ion-suppression, thus limiting the analysis of low abundant lipid species (unless they are derivatized) [43].

Among different hyphenated techniques, such as LC-MS, GC-MS, and CE-MS, that are complementary in their attempt to resolve chemical diversity, LC-ESI MS allows for the most comprehensive coverage of the polar metabolome and lipidome [44,45]. It allows for the simultaneous measurement of several hundred to thousands of metabolites (comprising lipids) from only minimal amounts of a biological sample in a single analysis. This coverage capacity is a benefit of LC separation that minimizes ion-suppression and maximizes measurement specificity by the separation of isobars and isomers and by providing retention time (RT) identifiers [46]. LC represents the best compromise with limited MS acquisition (scanning) speeds; by improving the specificity, and thus, S/N ratio, it enhances the quantity and the quality (i.e., purity) of acquired MS/MS data, essential for metabolite identification (in untargeted assays) and quantification (in targeted analysis) [47].

Due to inherent chemical diversity and the large size of the metabolome, there is no universal technique that can be used to assess the entire metabolome, i.e., "one size does not fit all". The choice of LC-MS analytical strategy, including the LC and MS modes of analysis, will depend on the type of metabolites to be measured (polar vs. nonpolar) and limitations with respect to time and sample amount, which will determine how many analysis modes could be combined to expand the metabolome and/or lipidome coverage [37,48].

#### 2.3.1. LC Techniques

The most commonly used LC techniques in metabolomics include Reversed-Phase Liquid Chromatography (RPLC), ion pairing RPLC, and HILIC. Stationary phase (hydrophobic or hydrophilic), mobile phase modifiers (formic acid, acetic acid, ammonium acetate or formate, ammonium fluoride, etc.), elution gradient (from highly aqueous to highly organic and vice versa), and sample diluent will vary depending on the chromatographic mode applied. Recognized for its reproducibility and broad applicability, RPLC is predominantly used in untargeted metabolomic assays. While RPLC can be used for profiling of mid-polar and non-polar metabolites, including complex lipids, recently, the major challenge in metabolomics has been the separation of highly hydrophilic central carbon metabolites [49], specifically to understand the metabolic shifts in cellular metabolism under different

conditions. To enhance the poor retention of hydrophilic metabolites by RPLC, ion pairing agents (e.g., alkyl sulfonates or heptafluorobutyric acid in positive mode, and long chain tertiary/quaternary amines such as tributylamine in negative mode) can be added into the mobile phase, where they combine with the analyte (i.e., cations or anions) to form an ion pair that can be efficiently retained by the reversed phase packing [50]. Yet, this strategy is not MS friendly, with the background signal of ion pairing agent causing system contamination and resulting in notable ion suppression and reduced sensitivity, thus demanding a dedicated LC-MS system. Alternative strategies, such as multimode C18 columns that contain cation and anion ligands (e.g., HSS T3 Waters, Milford, MA, US, Scherzo SM-C18 Imtakt USA) and, in particular, HILIC, have been developed and have become increasingly robust and popular for polar compound retention [51,52]. Indeed, stationary phases with derivatized silica, including diol, amine, and amide, have proven their efficiency and robustness in the separation of polar molecules through multiple mechanisms, such as partitioning between the mobile phase and enriched water layer on the stationary phase, hydrogen bonding, dipole–dipole interactions, etc. In addition, the stationary phases with zwitterionic functional groups (with the polymeric support, e.g., ZIC-HILIC and ZIC-pHILIC, ZIC stands for zwitterionic stationary phase) offer excellent performance in the retention of highly polar metabolites (e.g., di- and tri-carboxylic acids, phosphorylated energy currency metabolites) via ion exchange, and wide pH range stability (from 2 to 10) [51,53]. Besides polar metabolite separation, HILIC has also been increasingly used for complex lipid separation by class, according to polar head groups [54,55].

For an untargeted metabolomics experiment, one would ideally maximize data acquisition and metabolome coverage by combining HILIC and RPLC in both positive and negative ionization modes. Analysis using HILIC in acidic conditions in positive ionization mode would allow for the assessment of amino acid and acylcarnitine metabolism [56], while the analysis in basic conditions in negative ionization mode would provide insight into glycolysis, tricarboxylic acid cycle (TCA) cycle, purine and pyrimidine metabolism, etc [51]. Analysis using RPLC and non-polar eluents (often a combination of isopropanol (IPA) and acetonitrile would allow for comprehensive lipid profiling, including glycerolipids (TAGs—triacylglycerols, DAGs—diacylglycerols, and MAGs—monoacylglycerols), cholesterol esters (CEs), sphingolipids (sphingomyelins, ceramides), glycerophospholipids (PCs—phosphatidylcholines, PEs—phosphatidylethanolamines, PSs—phosphatidylserines, PIs—phosphatidylinositols, PGs—phosphatidylglycerols), and free fatty acids [57]. These analyses can be performed following two-phase extraction (e.g., MTBE/MeOH/H2O) or single step extraction using isopropanol or butanol and methanol solution (BuMe). When time and sample amount are limited, the researcher should decide depending on which metabolite classes are of the utmost relevance to answer the specific biological question.

#### 2.3.2. Mass Spectrometry Acquisition Modes

Following LC separation, MS detection must be performed in optimized conditions to acquire maximal high-quality MS and MS/MS data for metabolite quantification and identification (Figure 2). Optimal MS acquisition conditions are instrumentation-dependent and comprise ion source and analyzer parameters. For an untargeted experiment, data are usually acquired in full scan mode, where the instrument is set to scan the complete mass range from 50 to 1200 Da. Despite the fact that increasing mass-resolving power is beneficial to resolve co-eluting isobaric compounds and we may say that "*the higher the resolution the better*, *there may never be enough resolution to separate all the metabolites present in complex biological matrices*", in the small molecule "world", many compounds have the exact same accurate mass [58]. From this point of view, resolution becomes less important when compared to instrument scanning speed and sensitivity, essential for acquisition of maximum high-quality MS/MS data necessary to translate putative hits into metabolite identities [47,53]. During sample analysis, HRMS data acquisition can be followed by sequential acquisition of MS/MS data using data-independent acquisition (DIA; such as all-ion-fragmentation (AIF) in Agilent Q-TOF, MSE in Waters Q-TOF, or SWATH in Sciex TripleTOF, and BASIC DIA in Orbitrap) with a minimal loss of

sensitivity (approximately two times), or MS/MS data can be acquired only on pooled QC samples at the end of the run, in DIA or in data-dependent acquisition mode (DDA with a focus on top "n" ions, Table 1). In data-independent acquisition (DIA), all fragment ions for all precursors are acquired simultaneously, while in data-dependent acquisition (DDA) the ions for MS/MS acquisition are selected in real-time based on threshold intensity [59]. Finally, the filtered metabolite features of interest (i.e., those that vary significantly between two or more analyzed conditions) can be targeted for MS/MS data acquisition in selective or targeted MS/MS mode, a posteriori, following data processing, filtering, and statistical analysis. The pitfall of this strategy is the time lapse (and thus possible sample alterations) between the first batch of analyses in MS mode only, for relative quantification, and the targeted run to acquire MS/MS data on ions of interest for their identification.


**Table 1.** MS/MS data acquisition modes with their advantages and disadvantages.

Although DDA is still the most popular simultaneous MS/MS acquisition mode used, DIA is gaining attention following the development of MS/MS data deconvolution algorithms (to link precursor and product ions) and improved coverage for low abundant precursor ions [60–62]. In general, the quality and the amount of acquired MS/MS data depend on instrument acquisition speed and sensitivity (also related to metabolite ionization efficiency). With regards to instrument scanning speed, in DDA, attention should be paid to the *m*/*z* resolution window (wide vs. medium vs. narrow), the accumulation times, and the number of targeted precursor ions per scan [47]. To avoid the selection of the most highly abundant ions each time, across multiple scans, a preferred list of ions of interest can be defined and contaminant ions placed on the exclusion list. Data can also be acquired in a time-staggered fashion through a set of iterative injections (of pooled QC samples) with PC-driven exclusion (of ions on which the data has already been acquired in previous runs), which will significantly enhance the amount of acquired MS/MS data [63,64].

DIA can be applied in an all-ion-fragmentation mode (AIF, MSE) where the first quadrupole (Q1) transmits the full mass range (*m*/*z* 50–1700) of precursor ions in the collision cell, or with sequential windows (SWATH and SONAR on Q-TOFs or BASIC-DIA on Orbitrap), where the Q1 will transmit several increments (20–50 amu) across the mass range of interest sequentially in the collision cell (Table 1) [59]. Here, again, the number and the size of mass windows will depend on the instrument acquisition speed. a major challenge related to DIA is to re-establish the direct link between the precursors and their fragment ions or to correctly deconvolute the MS/MS spectra. The wider the isolation window for precursor ion selection, the higher the contamination is of MS/MS

spectra, making their deconvolution more difficult. Several algorithms have already been successfully implemented proving their efficiency in MS/MS data deconvolution (MS-DIAL [62], MetDIA [65], DecoMetDIA [66]), with a major limitation being the comprehensiveness of experimentally acquired spectral databases (e.g., METLIN [67], NIST, MoNA, MassBank [68], mzCloud, GNPS [69], etc.). Due to time-consuming standard characterization to expand these experimentally-derived spectral databases, considerable efforts were put towards the development of computational tools for in silico generation of mass spectra used for MS/MS data matching and metabolite annotation (e.g., iMet [70], LipidBlast [71], MetFrag [72], MetDNA [60], CSI:FingerID coupled to Sirius [73]; see section on metabolite identification below; Figure 3).

**Figure 3.** Overview of lipidomic data analysis (acquired by DDA) using MS-DIAL, the open-access software designed for simultaneous metabolite quantification and identification. Displayed are the MS/MS matched peaks (each lipid class is differently colored) with the example of phosphatidylcholine annotation using MS/MS matching against LipidBlast.

It should be emphasized here that coupling of ion mobility (IM) analyzers, as an additional separation technique to conventional LC-MS/MS analysis, can markedly facilitate metabolite identification, and even resolve stereoisomers [74]. The separation of ions according to their size and conformation prior to MS/MS data acquisition will also enhance spectral clarity and fragmentation specificity. Importantly, experimental collision cross-section (CCS) values can be computed (using drift tube ion mobility MS or DTIMS and traveling wave ion mobility MS or TWIMS) with very high reproducibility (Relative standard deviation or RSD < 2%) [75]. FAIMS (high-field asymmetric waveform ion mobility MS) is atmospheric pressure IM technology that can also be used as an orthogonal separation approach (also known as DMS or differential mobility spectrometry), although it does not allow for the acquisition of CCS values.

In an untargeted metabolomics experiment, one would ideally acquire as many MS and MS/MS data as possible, simultaneously, or at least within the same analysis batch. This would allow for the simultaneous metabolite quantification and identification (via MS/MS matching against spectral libraries and using computational tools like Sirius [73]) in an automated fashion. To reach optimal metabolome coverage and annotation, there is room for improvement on the instrumentation side (i.e., limited acquisition speeds, and sensitivity related to ionization efficiency and ion transmission), the need

to enhance the comprehensiveness of spectral libraries (taking into consideration the exposome), and to improve the computational approaches for annotation of unknown metabolites.

It is worth noting that the fastidious metabolite identification process in untargeted experiments often yields the identification of "*known (un)knowns*", as a consequence of the above-specified remaining challenges. This bias encouraged the development of high-coverage targeted methods for quantification of polar metabolites and lipids to bridge the gap between untargeted and targeted approaches. These methods can be strategically derived from DIA methods, such as SWATH, capable of acquiring MS/MS data for all detectable metabolites in a biological sample [76]. As a library of Multiple Reaction Monitoring (MRM) transitions, acquired on different instruments, the METLIN-MRM can be particularly useful to accelerate the development of broad-scale MRM methods [77].

#### *2.4. Data (Pre)Processing: from Peak Detection to Profile Alignment*

#### 2.4.1. Software for Data Pre-processing

The amount of raw data generated from an untargeted metabolomics study using mass spectrometry is often huge, with large file sizes (possibly up to 1–2 GB per sample) depending on the instrumentation used. Therefore, there is a need for large computational power or use of computational clusters or clouds for data processing. The data pre-processing pipeline consists of several important steps in order to extract the maximum useful information from the data, whilst eliminating redundancy. The many different software available for performing these data pre-processing steps range from MS vendor software to freely available scripts and software. Some examples of freeware are XCMS [78,79], MZmine2 [80], and MSDial [62] There is also the XCMS online platform [81,82], where you can upload your data and the processing will be performed for you, employing parameters set within the software.

#### 2.4.2. Important Steps in Data Pre-Processing

The first, crucial step is peak detection (or extraction). At this stage, the files are uploaded (read) into the software, and using a selected algorithm, the software will search for any peaks in the samples. a peak (or metabolite feature) may be defined as a distinct ion species with a unique *m*/*z* ratio and retention time (RT). It is important to note that one metabolite can be represented by multiple peaks or distinct ion species, namely, isotopes, adducts, in-source fragments, or multiple charged species. This peak detection is normally split into two steps: (1) Separation of mass traces and (2) filtering or detection of chromatographic features. The parameter settings at this stage will be important, such as signal to noise ratio (S/N) and width of the chromatographic peak, in order to enable the detection of peaks with very low S/N ratios while simultaneously filtering out random noise. These parameters, as well as maximal *m*/*z* deviation, can be calculated by looking at the raw data files of QC samples across the analytical run or similar (e.g., selected study samples across the run) and specified in the pre-processing parameters. Peak width range should be calculated using the narrowest and widest peaks in the chromatograms, again determined visually from QC samples or similar. Extracted ion chromatograms can be constructed to aid determination of these parameters. Similarly, S/N and *m*/*z* deviation should be calculated across the elution profile using high and low intensity peaks to ensure an accurate calculation. Typically *m*/*z* deviation is ~5 ppm for Orbitrap data and ~25 ppm for Q-ToF data. Once these peaks have been extracted, they need to be grouped, or matched, across all the samples in the dataset. This is to enable peak areas (or, in some cases, peak heights) to be compared across the samples in a semi-quantitative fashion. Untargeted metabolomics experiments can be large, particularly in the case of epidemiological studies where thousands of samples may be analyzed in a single run or across batches. Usually, retention time alignment is needed, as there may be peak shifting across the analytical run (due to changes in pH or temperature, column aging or build-up on the column). However, this is less frequent since the advent of U(H)PLC, and the authors have found that in the case of small datasets, retention time correction may no longer be required. Nonetheless, it is important to assess each dataset individually, and as most software performs this retention time

alignment, it is generally advisable to do so. The output at this stage will be a peak table containing *m*/*z*, RT, and abundance for each metabolite feature (peak) in every sample [83–85]. Depending on the software employed, grouping of isotopes/adducts, etc., may have been performed—if it has not, then software such as CAMERA, AStream, RAMClust, and the recently developed METLIN In-source fragment Annotation (MISA) [86,87] exist within the R environment to assist with this grouping and, therefore, data reduction [88–90]. Further, peak annotation may have been performed in some instances through linking with databases, such as with XCMSOnline. This peak table can then be further analyzed, either within the same software or using dedicated software such as SIMCA (Umetrics). Freeware available includes Metaboanalyst [91], a multipurpose software which can also provide pathway analysis tools.

#### 2.4.3. Dealing with Artefacts

The output from the data pre-processing software can be very large and complex, depending on the peak picking parameters, as described in the previous section. As instrument sensitivity increases, so does the likelihood of picking up noise and artefacts in the data. Artefacts can include solvent clusters, contaminants (from the column, vials, or solvents), and other spurious signals. These inflate the data and so need to be removed; thus, there are several approaches to tackling this challenge. a widely used approach in the metabolomics community since 2006 is the employment of QC samples [92]. These generally take the form of a pooled samples comprised of aliquots of all study samples, but may be a subset of samples if the size of the study is large [93]. Occasionally, a "surrogate" QC sample could be used, such as the NIST reference plasma material [94].

#### 2.4.4. The Importance of Quality Control

QC samples play a crucial role in untargeted metabolomics studies, in terms of monitoring system stability and data quality (summarized in Table 2). The QC sample will be injected at the start of the analytical batch in order to condition the column and assess instrument stability; the number of injections required may be sample- and column-dependent, but is often in the region of 10 injections [95,96]. Then, the same QC sample can be injected every 4–10 samples, making up to ~10% of the sample injections. This within-run QC can be used to assess stability within the run, e.g., retention time and signal intensity drifts. Importantly, a QC dilution series can be employed; this takes the form of serial dilutions from the QC sample [93]. The purpose of this dilution series is to identify and remove peaks (metabolite features) that do not respond to dilution in a linear manner (as determined by calculating coefficient of determination (r<sup>2</sup> or R2) values), as they are likely to be noise, or at least non-biological in origin. Additionally, the coefficient of variation (CV) can be calculated for every metabolite feature in the within-run QC samples. Features with a CV above a certain threshold, e.g., > 30%) can be removed from the dataset, as they are unlikely to be reliable biomarkers [30]. In some cases, metabolite features which appear in below a certain proportion of the QC samples (e.g., in < 75% of samples) could also be removed from the data. Lastly, the analysis of blank samples, such as blank mobile phases and also extraction blanks (where the sample preparation procedure has been followed but in the absence of biological sample), can provide valuable insight into the origin of many of the metabolite features reported. Those that appear in the blank samples are again likely to be non-biological in origin and so can be removed from further processing steps [93]. These data filtering and reduction steps can dramatically reduce the size of the dataset and streamline the subsequent data analysis procedure.


**Table 2.** Criteria for feature filtering using QC and blank samples in order to reduce data complexity and remove redundancy.

\* Some groups recommend a lower cut-off, e.g., 20% [97]; \*\* this removal should be performed carefully by evaluating the features whose response may not be linear due to their low abundance.

#### *2.5. Univariate and Multivariate Statistical Data Analysis*

Untargeted metabolomics studies generate a wealth of data, from which meaningful biological interpretations are desired. Statistical analysis of the data is another hugely important step in the metabolomics pipeline; therefore, there are many important parameters which must be considered. The most typical workflow is to perform multivariate analysis followed by univariate analysis in order to elucidate and validate potentially discriminatory metabolites [98,99].

#### 2.5.1. Multivariate Approaches

Multivariate analysis encompasses methods to reduce the complexity of data, such as that generated from a metabolomics study, where the number of variables (in this case, metabolite features) is greater than the number of samples. Multivariate analysis can be performed using vendor software, programming platforms such as R and Metaboanalyst, or commercial software such as MATLAB® (MathWorks) or SIMCA (Umetrics).

#### 2.5.2. Principal Components Analysis

The first step is generally an unsupervised approach, such as principal components analysis (PCA), which can be used to visualize data structure, class differences, and outliers (Figure 4). PCA can be considered as to be finding *maximal variation* between the groups of interest. Importantly with unsupervised approaches, no class information is given, and so an unbiased view of class separation can be obtained. When visualizing a PCA scores plot, the first principal component (PC1) explains the largest variation in the data, followed by PC2, PC3, etc. Multiple classes can be viewed on the scores plot, in two or three dimensions, and so group separation can be observed, e.g., over time. The loadings plot provides an indication of which metabolite features are responsible for any observed separation, e.g., between classes, and can be mapped onto the scores plot if desired, in what is known as a bi-plot.

**Figure 4.** Simplified overview of PCA and OPLS-DA showing (**A**) good separation on PCA and OPLS-DA scores plots. High R2 and Q2 values indicate good model robustness and predictive capability. Permutation test indicates a valid model. (**B**) No separation on the PCA scores plot of PC1 vs. PC2, but separation is still achieved using OPLS-DA. In this instance, the model could be overfitted and unreliable. It is advisable to check for separation in other components, e.g, PC2 vs. PC3, as well as to assess R2 and Q2 and perform permutation tests. CV-ANOVA can also be used to assess model validity (not shown).

#### 2.5.3. Supervised Approaches

Once separation has been assessed, supervised analyses can be performed, such as partial least squares discriminant analysis (PLS-DA) and its orthogonal counterpart, OPLS-DA. These approaches incorporate class information and so find a way to achieve the maximal separation between the classes of interest. In the scores plots, the x-axis shows the variation *between* the groups, while the y-axis shows variation *within* the groups. These methods can suffer from the risk of over-fitting the data—they can produce class separation even with random data—and must be interpreted with caution (Figure 4) [100]. This can have detrimental downstream impacts on biomarker discovery and validation as results may not be reliable or reproducible. R2 and Q2 values can be used to assess the model robustness and predictive power; these values will be low—particularly the Q2—in an overfitted model. a low Q2 indicates that new data would not be predicted accurately in the model. Further, machine learning-based model validation approaches, such as CV-ANOVA (based on ANOVA of the cross-validated residuals), can assess model validity [101]. Permutation tests can also be used to assess the significance of a classification. The class assignment is permuted repeatedly, with a model between the data and the permuted class-assignment built for each permutation. These models are then compared with the original multivariate model [102]. Variable Importance for the Projection (VIP) scores can be used to identify the metabolite features contributing most to any class separation; VIP scores > 1 are suggested to be important, whilst those < 1 are suggested to be unimportant for the model. The range of VIP scores will vary with each dataset and, in some studies, there may be hundreds of metabolite features with a VIP score around 1, meaning that the cut-off applied is much higher. OPLS-DA S plots can also be used to identify discriminatory metabolite features warranting further investigation.

#### 2.5.4. Univariate Methods

Even though multivariate analysis tools can be useful for exploring metabolomics data and guiding researchers towards potential discriminatory biomarkers, there are several pitfalls to these approaches. As discussed above, supervised models suffer from the risk of overfitting. Datasets containing a large amount of sparse data (in terms of the number of input variables) or missing data (which can occur with some pre-processing tools) may compromise model performance [103]. To this end, features which have been proposed as discriminatory from multivariate analyses can be further validated using appropriate univariate statistics [81]. However, univariate tools are also not without their challenges, and it is easy to inadvertently apply the wrong statistical test to a dataset. It is important to assess the data at the start to ensure the correct test is being performed, e.g., whether to use a parametric or non-parametric test. a rule of thumb is that if the data are normally distributed, then a parametric test, such as a *t*-test, can be used. Normality can be tested using, e.g., the Shapiro–Wilk test, which is good when the sample size is < 50. Note that it is not possible to assess normality of the distribution if the sample size is small, and some tests do not cope well with small sample sizes. Parametric tests are considered to be more powerful than non-parametric tests, with less risk of a false negative (i.e., non-significant) result than with a non-parametric test. However, when dealing with populations that are non-normally distributed, with unequal variances and/or unequal small sample sizes—all possible in untargeted metabolomics—often a non-parametric test can perform better [81]. an additional complication is that univariate tests applied separately to numerous variables will overlook correlations within metabolite features, which may be important in elucidating related metabolites and interpreting biological pathways.

#### 2.5.5. Multiple Comparison Testing

In untargeted metabolomics studies, it is likely that the number of metabolite features (variables) is greater than the number of samples analyzed [104,105]. If univariate tests were performed on each of these variables, the false discovery rate (the chance of significance being found) is high. These are known as Type I errors (false positives) and must be addressed if valid metabolite markers and meaningful biological conclusions are to be found. To combat this issue, multiple comparison testing can be performed. Commonly used approaches for false discovery rate correction (FDR) are the Bonferroni correction (a conservative method) or the less conservative Benjamini–Hochberg or Benjamini–Yukatelli corrections. These will adjust the *p*-value cut-off, meaning that fewer variables will reach significance and, therefore, there will be fewer false positive results. Using a combination of multivariate and univariate testing, a potential biomarker should have a VIP > 1 and a *p*-value < 0.05 (or the corrected value after FDR—false discovery rate correction) [106].

#### *2.6. Metabolite Identification: From Spectral Database Matching to Computational Approaches for Unknown Metabolite Annotation*

Following feature filtering using QC-based estimates (see Table 2) and statistical criteria to extract the metabolite features of interest, the next challenge constitutes assigning the identity to these features and placing them in a biochemically relevant context for data interpretation. As specified in Section 3, LC-MS is not only the most versatile and comprehensive methodology with respect to metabolome and lipidome coverage, but also provides important information for metabolite structure elucidation, including RT, accurate mass, isotope distribution, and MS/MS fragmentation pattern, in addition to IMS (and/or CCS value). Despite this, the majority of metabolite features in untargeted metabolomic datasets (approximately 80%, so-called "dark matter") remain un-annotated or misidentified [75,107,108], hiding many unknown metabolites, but also high levels of chemical and informatic noise (artefacts of peak detection algorithms) and redundancy (due to defects in feature annotation and grouping algorithms). We distinguish two main bottlenecks, one associated with known metabolite misidentification and another one related to unknown or novel metabolite identification (see Table 3).


**Table 3.** Major problems and solutions associated with metabolite identification in metabolomic datasets. The references for different tools are cited in the main text.

Metabolite identification starts, in general, by database searching using accurate mass (*m*/*z*) measurements (up to 4 decimal places) and prediction of elemental composition (i.e., molecular formula). Accurate mass searches yield many putative hits, including potentially false matches due to the presence of isomers, interferences between the metabolites of similar molecular weight (i.e., isobars), and mis-annotation of in-source fragments and even certain adducts (see Table 3) [26]. In most cases, the MS/MS fragmentation pattern, defined by the product ion masses and their relative abundances, will provide sufficiently specific data to confirm the metabolite identity with a high level of confidence. The exceptions are structural and/or stereoisomers (i.e., L- and D-serine, for example, or complex lipids differing only in positions of unsaturations), which can be distinguished only with the additional chromatographic resolution (RT, chiral columns) and/or IMS (and CCS values) data.

MS/MS spectra acquired from samples will be matched against spectral databases containing experimentally acquired spectra on pure standards (e.g., METLIN [67,109], NIST, mzCloud) or any annotated structures (community databases such as MassBank [68], including European MassBank, MassBank of North America, and GNPS based on crowd sourcing [69]). The content of these databases has been extensively reviewed in several recent publications [75,110]. MS/MS spectra matching is usually followed by the similarity score calculations for matches (e.g., METLIN online database) and ranking of candidates based on the similarity to the reference spectra [47,111]. While five different levels of reporting confidence in metabolite identification have been established by the Metabolomics Standards Initiative [75], absolute identity can only be made when an authentic commercially available standard has been compared to the analyte of interest and found to match all applicable measurements (accurate *m*/*z*, MS/MS, RT, etc.). When standards are not available, the unknown metabolite of interest needs to be isolated from the biological matrix (e.g., plant, fungi, sponge extract) using LC, and the combined LC-HRMS and NMR analysis will allow for structural elucidation. The novel metabolite identity needs to be confirmed by custom synthesis of standard and its analysis under the same analytical conditions.

To facilitate and automatize metabolite identification, significant efforts were made to further expand the experimentally-derived spectral libraries by MS/MS data acquisition (on different instruments, collision energies, and ionization modes) and sharing. However, compared to the size and diversity of endogenous and exogenous metabolome, this conventional method of metabolite annotation by matching the experimentally acquired MS/MS spectra to standard spectral databases remains limited by the size of databases and the lack of commercially available standards for many cellular metabolites. To address this problem, recently, the computational metabolomics community has grown to develop and improve computational approaches for known and unknown metabolite identification (Table 3). These computational metabolomic approaches employ two main strategies: (1) In silico prediction of fragmentation MS/MS spectra from chemical structures of known compounds, and (2) in silico prediction of molecular substructures (i.e., molecular fingerprints or feature vectors that encode the structure of a molecule) and general chemical properties of the unknowns from experimentally acquired MS/MS spectra [112]. With the in silico fragmentation methods, the experimentally acquired spectra of an unknown metabolite (for which reference spectra are not available) can be matched against in silico theoretically predicted spectra simulated on known candidate structures retrieved from databases (Human Metabolome Database (HMDB), PubChem, KEGG, etc.) [113]. In silico fragmentation from chemical structures of known compounds can be computed by rule- (e.g., MS-FINDER, LipidBlast), combinatorial- (e.g., MetFrag), and machine learning-based methods (e.g., CFM-ID) [75]. Rule-based generation of specific fragmentation patterns and heuristic modeling of ion abundances is efficient for classes that have consistent and predicative fragmentation patterns, such as lipids (e.g., LipidBlast).

The in silico prediction of molecular substructures are machine learning-based methods that can translate the MS/MS spectra to metabolite structure information. To learn the mapping of an MS/MS spectrum to a molecule structure, these methods need to be trained on spectral databases of known metabolites. In general, machine learning methods can be divided in two groups, supervised learning for substructure prediction (e.g., CSI:FingerID) and unsupervised learning for substructure annotation and grouping of metabolites based on shared, biochemically relevant substructures (e.g., MS2LDA) [112,114–116]. The main objective of supervised methods, such as CSI:FingerID integrated in Sirius tool, is to determine, using a database of molecular structures, the structure that best fits the experimental data. In Sirius 4, the assessment of molecular structures from MS/MS data can be performed automatically for the entire LC-MS dataset (rather than per spectrum) and MS data-driven annotations can be obtained for all detected features [73]. These machine learning approaches were essential for the recent progress in metabolite identification and will pave the future of metabolite structural identification.

Data sharing will also be key to advance these computational approaches. There are two main repositories that can be used for metabolomics data sharing, the Metabolomics Workbench (US, [117]) and MetaboLights (EU, [118]). There is space for the improvement of data upload, which demands fastidious data preparation due to considerable requirements on sample and method related metadata.

#### *2.7. Metabolite Features and*/*or Metabolites to Pathways and Metabolic Networks*

#### 2.7.1. Metabolic Networking for Metabolite Identification

While pathway and network analysis are mainly used to facilitate metabolite data visualization and interpretation, the biochemical knowledge about chemical reactions (i.e., metabolite conversions via enzymes) and metabolic pathways integrated within a metabolic network (to sustain cellular function) can also be used to facilitate metabolite identification. As an alternative to the above-described tools relying only on the spectral data and information related to molecular (chemical) structure, several approaches, such as Mummichog [119], PIUMet [120], and MetDNA [120], based on the "*features to pathways*" principle, have been developed to facilitate and speed up metabolite identification using reference metabolic network models. This biochemically relevant information can guide with respect to the metabolites that the organism of interest is able to produce and thus increase the confidence of metabolite annotations (see Table 3) [121]. Both Mummichog and PIUMet rely on the assumptions that locally enriched metabolite matches within the metabolic network are true, while false matches will distribute randomly. Both tools will infer metabolically active pathways without requiring metabolite identification. Finally, metabolite identities will be predicted and chemical information on annotated isotopes and adducts will be used to evaluate the prediction confidence level. Metabolite annotation and Dysregulated Network Analysis, or MetDNA, uses the metabolic network knowledge for the annotation of known metabolites (from highly conserved primary metabolism) detected in untargeted experiments. Annotation starts from the set of identified "seed" metabolites by predicting their reaction-paired neighbor metabolites on the assumption of their structural similarities. Through the reiterated application of this recursive algorithm, the number of annotated metabolites will be progressively propagated and significantly enhanced (to up to 2000 metabolites from one untargeted experiment) [60]. Using a similar principle, the GNPS or Global Natural Products Social Molecular Networking will construct the molecular similarity network based on the similarity of MS/MS spectra (two metabolites share similar MS/MS data due to their structural similarity) with the aim to annotate the unknown natural products using already annotated metabolites (by the community) within the same sub-network. While these networking approaches are fast and valuable for the reduction of metabolomic datasets, however, annotation remains ambiguous and should be validated through more specific targeted MS/MS analysis.

#### 2.7.2. Metabolic Networking to Visualize and Interpret Metabolite Changes

In general, changes at the metabolite level cannot be looked at independently outside of the context (of their interactions with other metabolites, proteins, and genes), and meaningful changes can be missed by relying only on the arbitrary significance threshold (or *p*-value). It is thus of the utmost importance to interpret identified alterations at the metabolite level within the metabolic networks, especially when it comes to the discovery and understanding of subtle (fold change < 2) but coordinated and physiologically relevant changes, often the case in biomedical and human population studies. Metabolic networks, derived from genome-scale metabolic network models (GSMNM) are the most accurate ways to describe and represent metabolism, as compared to discrete pathways [122]. Multiple metabolic pathways share metabolites, and the synthesis of one metabolite can require the integrated cooperation of more than one pathway. The reconstructed GSMNM from annotated gene–protein reaction (GPR) associations can define the metabolic capacity of a model organism(s), in any specified condition. While the primary metabolic pathways are highly conserved across model organisms, they can be differentially regulated, in an organism-specific manner, as a function of genetic effects (i.e., mutations in different genotypes) and environmental exposures. Efforts are needed for systematic characterization of the model organism metabolomes (across different conditions,

using quantitative information), and to develop compartmentalized models for different organs and host–microbiome metabolic interactions [123–126].

To interpret data from metabolomics experiments and gather biologically meaningful information, one would ideally perform two types of analysis: (1) Mapping and visualization of metabolite changes in the graphical representation of cell metabolism, i.e., metabolic network; and (2) statistical analysis to determine the overrepresented pathways, known as metabolite set enrichment analysis (MSEA). Most of the open access tools designed for pathway and network analysis provide both of these functionalities, visualization to assess if metabolites are involved in the same pathways and how they are connected within a metabolic network and enrichment analysis to highlight the pathways associated with the examined phenotype. The open access software that provide these functionalities in the interactive fashion are listed in Table 4. For the computational community, the recently assembled MetaRbolomics toolbox provides an extensive resume of R packages that can be used for data processing, metabolite annotation, and biochemical network and pathway analysis [127].

**Table 4.** List of selected open access web servers for interactive pathway visualization, metabolite mapping, and visualization in the context of pathways and metabolic networks, and metabolite set enrichment and overrepresentation analysis (MSEA, ORA).


**\*** Features relevant to pathway and network analysis have been listed here, MetaboAnalyst and XCMS online servers provide plenty of other functionalities related to data processing and analysis.

In order to map the identified metabolite changes in the biochemically relevant context, one first needs to convert the metabolite identities into the relevant metabolite identifiers (e.g., KEGG, HMDB, Recon, etc.) that can be used for mapping to metabolic networks derived from genome-scale models (as a product of genome sequencing, annotation, and, finally, metabolic model reconstruction). The conversion to different metabolite identifiers can be executed in batches using a chemical translation service, provided by UC Davis [134] or MetaboAnalyst [91]. Users should consider that a portion of identifiers may be missing and/or incorrectly matched (approximately 10%), thus manual curation may be necessary prior to the upload to pathway or network analysis tools for further analysis and visualization. Metabolite mapping would ideally be based on InChIs or InChIKeys, requiring that these identifiers are specified in both databases and networks [126]. There is an important challenge here regarding lipids, due to the ambiguous identification given by sum composition (i.e., PC 34:2) that can correspond to many similar lipid species having different fatty acid composition (16:1/18:1, 16:0/18:2, etc.) [132].

Visualization of metabolite changes in the context of metabolic networks brings together chemical reactions (of which metabolites are the products or substrates), and the genes coding for the enzymes making these reactions possible. MetExplore is among the most comprehensive tools that allows for the construction of tailored networks and collaborative curation and annotation of metabolic models, in addition to the interactive network visualization, from the entire network down to detailed sub-networks (build from selected network elements—a pathway or a set of genes) [128,135]. In addition to visual inspection, flux consistency is checked for the metabolic model, i.e., network, validation. MetExplore integrates a large panel of metabolic models (called "biosources" in MetExplore) depending on the model organism, and each metabolic network can be exported as an SBML or Excel file. Mapping of metabolites can be achieved using their network identifiers (KEGG, Recon, etc.) and, further on, smart filters can be applied to select the reactions involved in a combination of pathways (e.g., enriched pathways) of interest to be visualized—through the MetExploreViz web module (Figure 5). The MSEA is integrated and performed using hypergeometric tests (corrected with Bonferroni or Benjamini–Hochberg methods). Specific metabolites and pathways can be highlighted, edited, and exported, and the shortest paths between the metabolites of interest can be automatically extracted to reduce visual complexity, thus allowing for data mining.

**Figure 5.** Metabolite mapping on the metabolic networks—an overview of MetExplore network Viz functionalities. The projected network has been created from the list of chemical reactions (in the cart on the right side of the figure)—derived from the list of identified metabolites whose levels varied significantly (as a result of brain cell profiling). The extent of each pathway has been encircled and colored for visualization. Alanine, aspartate and glutamate metabolism, and arginine biosynthesis have been highlighted as enriched (using integrated ORA).

As mentioned above, in addition to metabolic (sub)network visualization, metabolite set enrichment analysis (MSEA) as a metabolomic counterpart of the gene set enrichment analysis (GSEA) and/or over-representation analysis (ORA) are used to investigate the metabolic pathways whose activity differs among analyzed conditions (e.g., CTRL vs. disease). MSEA takes into consideration the quantitative measure associated with each metabolite (i.e., abundance or concentration, and fold change) [121]. MSEA firstly assigns metabolites to pre-defined groups of functionally related metabolites (or metabolite sets) based on references databases (e.g., KEGG, HMDB; Table 5). The metabolite sets can be defined as biochemical or signaling pathways (i.e., metabolites involved in the same biological process), pathways associated with a metabolic disease (i.e., metabolites that vary significantly under the same pathological conditions, suggested by HMDB, [136,137]), pathways active in specific organs, tissues, or organelles (i.e., metabolites present in the same location, suggested by HMDB, [137]), etc. MSEA then applies *Globaltest* [138] to detect the subtle but consistent and coordinated changes (i.e., differences) among the group of metabolites (i.e., pathway) between two conditions, and thus identifies the affected (or deregulated) biochemical pathway associated with the analyzed outcome or phenotype [91,121]. The obtained *p*-value gives the probability that none of the matched compounds in the group of metabolites is associated with the phenotype. a closely related approach to MSEA, an over-representation analysis (ORA), is used to evaluate the probability that the particular set of metabolites (e.g., biochemical pathway) is represented, within a defined list of metabolites of interest, more than expected by random chance. For ORA, a user can provide only a list of metabolite identifiers, corresponding to metabolites that vary significantly between two analyzed conditions. Several probability tests, such as Fisher's exact test, binomial probability, or hypergeometric distribution test, can be applied, followed by the correction for multiple testing. Here, the reference metabolome should comprise the metabolite sets that can be detected in the analytical conditions used, thus reflecting the analytical method coverage. If the entire library of metabolite sets is used for ORA by default, the observed enrichment may be a consequence of applied analytical platform bias instead of being biologically relevant. ORA and/or MSEA are integrated in many different pathway and/or metabolic network analysis software, such as MetaboAnalyst, MetExplore, Pathvisio, etc. Finally, MetaboAnalyst allows also for the combined MSEA and pathway topology analysis that will display pathway impact values based on centrality measure—local quantitative measure of the position of a node (or a «key» position) relative to the other nodes in the network.

Although the tools for metabolic network analysis are being steadily improved by the computational community, there are still a number of challenges, related to metabolome coverage bias of the experiment (i.e., analytical limitations), scarcity of well-annotated metabolomics data (number of unknowns or non-annotated metabolite(s) (features) remains high), and, finally, the lack of knowledge about network regulation. It is also important to consider that the metabolome cannot be computed directly from the genome, and that many metabolites still need to be integrated into our current metabolic networks, thus making use of the wealth of data generated in metabolomic experiments.

**Table 5.** List of open access knowledge databases (used in the above listed web servers). Some databases have been extended into pathway browsers for interactive metabolite mapping. Although some databases are gene-centric, all of them are searchable for metabolites and represent a great source of biochemical knowledge for metabolite data interpretation.


#### *2.8. From Untargeted to Targeted Assays*

Global or untargeted metabolomics provides the opportunity for biomarker discovery and hypothesis generation. Potentially, it can enable the elucidation of the involvement of previously unknown or unsuspected pathways in disease states or in response to therapy. Inherently, this untargeted approach does not bias for or against specific analyte classes and provides a wide view of the metabolome. Sample preparation and analytical methods are somewhat generic and are usually optimized for sample type. However, with this approach comes the bottleneck of metabolite feature annotation and metabolite identification, as described in this review. Therefore, high-coverage targeted assays are becoming more prominent in the field of metabolomics. With targeted assays, tandem or triple quadrupole mass spectrometers are employed, with lower mass resolution than the Orbitrap or Q-ToF mass spectrometers used for untargeted analyses. However, these have the advantages of lower cost, higher sensitivity, linearity, and specificity. By employing isotopically labelled standards of the analytes of interest, which are spiked into each study sample, absolute quantification can be achieved. Furthermore, as the analytes being measured are known upfront, and the chromatographic and mass spectrometric methods are optimized at the start, run times can be much shorter than for untargeted analyses. There are guidelines which can be followed for ensuring accuracy and precision of the assay, such as those laid out by the FDA [146]. Software exist for the analysis of targeted data, either vendor provided or freeware such as Skyline [147]. It is likely that as this field of research advances, more targeted assays will be incorporated into the metabolomics workflow.

#### **3. Conclusions**

Untargeted metabolomics is a powerful approach to understanding changes due to disease, drug treatment, or environmental factors in a multitude of human, animal, and cell models. However, as metabolism is complex, so are the data produced in these studies. It is therefore crucial to be vigilant at every stage of the experiment. If the study has not been designed correctly, it will be hard to elucidate biologically relevant information, as confounding factors may overwhelm any biological changes. To maximize the metabolome coverage, it is necessary to acquire data in several chromatographic and ionization modes, ideally HILIC for polar metabolites and RPLC for complex lipids (using non-polar solvents for elution). MS/MS data—of high quality and volume—can be acquired in DDA mode using iterative injections with PC-driven exclusion and/or in DIA mode with sequential mass windows (e.g., SWATH, SONAR). Furthermore, it is of the utmost importance to pre-process the data correctly, as there will inherently be redundancy in the data. As metabolite identification remains the bottleneck in metabolomics studies, so stringent approaches are needed to ensure that models have been validated and only the strongest candidates are pursued through the identification pipeline. The comprehensiveness of experimentally generated and in silico-derived spectral databases has grown significantly, and their integration into the data processing workflow, together with the improvement of computational approaches (for in silico prediction of MS/MS data), are paving the way towards automated MS/MS data matching to facilitate metabolite annotation. Finally, the advancements in metabolic network analysis tools are enabling more mechanistic insights, beyond the biomarker discovery. Here, metabolite data provide crucial complementary information on "what has indeed happened", as the phenotype readout at the molecular level, thus representing the "missing piece" of puzzle towards multi-scale omics data integration for more accurate interpretation of biological processes.

**Author Contributions:** J.I. and E.J.W. contributed equally to the preparation of this manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors acknowledge all of the members of their respective teams, from the University of Lausanne and Imperial College London, as well as their collaborators (fundamentalist and clinicians), for their support and interaction during many years of involvement with metabolomics.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **MetaboAnalystR 2.0: From Raw Spectra to Biological Insights**

#### **Jasmine Chong 1, Mai Yamamoto <sup>1</sup> and Jianguo Xia 1,2,\***


Received: 5 March 2019; Accepted: 21 March 2019; Published: 22 March 2019

**Abstract:** Global metabolomics based on high-resolution liquid chromatography mass spectrometry (LC-MS) has been increasingly employed in recent large-scale multi-omics studies. Processing and interpretation of these complex metabolomics datasets have become a key challenge in current computational metabolomics. Here, we introduce MetaboAnalystR 2.0 for comprehensive LC-MS data processing, statistical analysis, and functional interpretation. Compared to the previous version, this new release seamlessly integrates XCMS and CAMERA to support raw spectral processing and peak annotation, and also features high-performance implementations of mummichog and GSEA approaches for predictions of pathway activities. The application and utility of the MetaboAnalystR 2.0 workflow were demonstrated using a synthetic benchmark dataset and a clinical dataset. In summary, MetaboAnalystR 2.0 offers a unified and flexible workflow that enables end-to-end analysis of LC-MS metabolomics data within the open-source R environment.

**Keywords:** global metabolomics; LC-MS; spectra processing; pathway analysis; enrichment analysis

#### **1. Introduction**

Metabolomics is the comprehensive study of all small molecule metabolites (<1500 Da) detected within a biological system. An individual's metabolic profile represents the functional product of interactions among genetics, lifestyle, environment, diet, and native microbiota, which closely reflects his or her health status [1,2]. The metabolome thus serves as the link between genotype and phenotype, and metabolomics will play a critical role in the development and implementation of precision medicine [3,4].

There are two general approaches in conducting metabolomics. Targeted metabolomics aim to study a predefined set of metabolites, requiring familiarity with the system [3]. Untargeted metabolomics, also known as global metabolomics, aim to measure the global set of metabolites within a sample without a prior knowledge of the system. A typical metabolomics analysis workflow involves three main steps: raw data processing, statistical analysis, and functional interpretation (Figure 1). Global metabolomics requires more sensitive analytics platforms to achieve comprehensive measurement. High-resolution liquid chromatography-mass spectrometry (LC-MS) systems is currently the main workhorse for global metabolomics. The platform often generates thousands of signals, including true biological signals from metabolites, their adducts, fragments, and isotopes, as well as noise signals from contaminants and artifacts [5]. Computational tools able to significantly reduce noise in MS spectra are crucial for more meaningful downstream analyses [6].

There are several powerful computational workflows including commercial tools such as Mass Profiler (Agilent Technologies) and Compound Discoverer (Thermo Scientific), cloud-based software such as XCMS Online [7] and Workflow4Metabolomics [8], desktop software such as MZmine2 [9], MS-DIAL [10], and Open-MS [11], and finally R packages such as MAIT [12] and metaX [13]. Most of

these software focus on addressing one of the two main tasks: spectral processing or statistical analysis. Consequently, users must often learn several tools to meet their data analysis needs. Due to compatibility issues, users often have to write scripts to convert outputs from one tool in order to use another tool.

**Figure 1.** A typical metabolomics data analysis workflow including raw data processing, statistical analysis and functional interpretation.

Tools for functional interpretation of global metabolomics data is in general lacking or poorly addressed [14,15]. A prerequisite for metabolomics data interpretation is metabolite identification, thereby permitting the contextualization of annotated peaks in metabolic pathways and their integration with other omics data. However, even with high mass accuracy afforded by the current high-resolution MS platforms, it is often impossible to uniquely identify a given peak based on its mass alone [16]. Researchers usually need to manually search compound databases and then perform further experimental validations such as tandem MS. Novel bioinformatics tools are urgently needed to enable researchers to gain biological insights with a minimum amount of manual efforts. To get around this bottleneck, a key concept is to shift the unit of analysis from individual compounds to individual pathways or a group of functionally related compounds (i.e., metabolite sets [17]). The general assumption is that the collective behavior of a group is more robust against a certain degree of random errors of individuals. The mummichog algorithm is the first implementation of this concept to infer pathway activities from a ranked MS peaks [18]. The original algorithm implements an over-representation analysis (ORA) method to evaluate pathway-level enrichment based on significant peaks. An alternative approach is the Gene Set Enrichment Analysis (GSEA) method, which is widely used to test enriched functions from ranked gene lists [19]. Unlike ORA, GSEA considers the overall ranks of features without using a significance cutoff. It can detect subtle and consistent changes which could be missed from using ORA methods. Despite its widespread applications in gene expression profiling, it has not yet been applied to global metabolomics.

MetaboAnalyst is one of the most widely used tools for statistical and functional analysis of metabolomics data [20–23]. It was initially designed for targeted metabolomics, and subsequent releases gradually introduced many statistical methods applicable to both targeted and untargeted metabolomics. Due to its web-based implementation, there is very limited support for raw spectra processing and peak annotation. The most recent update (version 4.0) was released with a companion R package, MetaboAnalystR (v1.0), to help tackle issues associated with workflow customization, reproducibility, and handling large datasets [24].

Here, we present MetaboAnalystR (v2.0) to address the two important gaps left in its previous version: (1) raw spectral processing - we have implemented comprehensive support for raw LC-MS spectral data processing including peak picking, peak alignment, and peak annotations; and (2) functional interpretation directly from *m*/*z* peaks - in addition to an efficient implementation of the mummichog algorithm [18], we have added a new method to support pathway activity prediction based on the well-established GSEA algorithm [19]. We showcase the performance of these new functions through two case studies.

#### **2. Results**

MetaboAnalystR 2.0 consists of a series of flexible R functions that can take a variety of user-supplied data and parameters to perform end-to-end metabolomics data analysis. The source code is freely available at the GitHub repository (https://github.com/xia-lab/MetaboAnalystR). Detailed instructions, tutorials, troubleshooting tips, example datasets, and analyses discussed in this paper are also available in this repository.

To demonstrate the utility of MetaboAnalystR 2.0 workflow, we present the results from two case studies: (i) a synthetic benchmark dataset to evaluate the raw MS spectra processing functions, with a focus on its peak detection and quantification performance; and (ii) a clinical pediatric inflammatory bowel disease (IBD) dataset to showcase the overall workflow, with a focus on its capacity to provide biological insights. All R scripts to perform the entire metabolomics data analysis pipeline are available from the MetaboAnalystR GitHub repository under the section "Case Studies". The accompanying vignette ("The MetaboAnalystR 2.0 Workflow") provides a step-by-step tutorial to demonstrate how to use MetaboAnalystR 2.0 to perform an end-to-end metabolomics data analysis on a subset of 12 of the 48 clinical IBD samples. This tutorial was created on a Dell XPS 9570 laptop running Ubuntu 16.04 with 16 GB of memory. The total running time of the tutorial was 14 min, averaging ~1.25 min per sample, using 6 cores in parallel and 10.5 GB of memory.

#### *2.1. Benchmark Case Study*

We first demonstrate the accuracy of the raw data preprocessing module using a benchmark dataset comprised of a mixture of 1100 known compounds ranging in size from 100 to 1300 Da [25]. The original study used a targeted analysis to obtain their benchmark feature list, which we used as the ground truth to evaluate our workflow. As shown in Table 1, the original study detected 35,215 peaks using XCMS Online, with 820 classified as true features. Using the same data preprocessing parameters as published, MetaboAnalystR 2.0 detected 21,013 peaks from the benchmark data. Among them, 732 matched the true features based on *m*/*z* and retention time (10 ppm and 0.3 min RT tolerance). Next, we compared the number of accurately quantified true features using MetaboAnalystR 2.0 to those from the original manuscript using XCMS Online (Table 1). Features were accurately quantified if their fold changes had a <20% relative error as compared to the benchmark data. MetaboAnalystR 2.0 accurately quantified 632 features and identified 45 truly discriminating features.


**Table 1.** Comparison of peak identification and quantification accuracies using the benchmark dataset between MetaboAnalystR 2.0 and the original manuscript using XCMS Online.

#### *2.2. IBD Case Study*

The 48 fecal samples were obtained from 24 pediatric Crohn's Disease (CD) patients and 24 pediatric healthy controls (Table S1). Our workflow detected 8187 features which were further reduced to 6930 features after filtering out isotopes and features missing in >50% of samples. After exclusion of low-variance features, a total of 4113 features were analyzed using the standard MetaboAnalystR functions.

Mann–Whitney U test and fold change analysis detected 59 features that were significantly different between CD and healthy controls. Differences between CD and healthy controls were evaluated using PCA, PLS-DA, and OPLS-DA. The PCA showed an overlapping of clusters along the first two components, with CD exhibiting a wider data distribution (Figure S1). This indicates an overall similarity of the metabolic profiles between CD and healthy controls but larger heterogeneity within CD patients. The PLS-DA score plot showed a clear separation between the two groups (Figure S2). Ten-fold cross validation of two PLS-DA components gave an *R2* of 0.912 and *Q2* of 0.424 (Figure S3). The OPLS-DA score plot shows a clear separation between CD and healthy controls (Figure 2) with an R2Y of 0.979 and Q2 of 0.522, respectively. To further evaluate the model, we performed permutation tests (*n* = 1000). The empirical *p* values were 0.026 for R2Y and <0.001 for Q2. Altogether, a clear distinction between the metabolome of CD and healthy controls was observed.

**Figure 2.** The OPLS-DA score plot based on the stool metabolome of 24 pediatric Crohn's disease patients and 24 healthy children

To gain potential biological insights from the global metabolomics data, we applied both mummichog and GSEA algorithms and integrated their results (Figure 3). Mummichog suggested that differentially abundant features between CD and healthy patients were associated with perturbations in bile acid biosynthesis and fatty acid activation, as well as vitamin E, fatty acid, and vitamin D3 metabolism. The GSEA algorithm also identified alterations in bile acid biosynthesis. Moreover, it identified differences in androgen and estrogen biosynthesis and metabolism, squalene and cholesterol biosynthesis, biopterin metabolism, and butyrate metabolism. More details of the top 5 enriched pathways from both methods are given in Table 2.

**Figure 3.** The scatter plot integrating GSEA (x-axis) and mummichog (y-axis) pathway analysis results. The size and color of the circles correspond to their transformed combined *p*-values. The blue and pink areas highlight significant pathways based on either GSEA (pink) or mummichog (blue).

**Table 2.** The top five enriched metabolic pathways identified using the mummichog algorithm (*PerformMummichog*) and GSEA (*PerformGSEA*) in MetaboAnalystR 2.0.


\* The mummichog compound hits represent the number of significant compounds divided by the total number of compound hits per pathway.

Interestingly, the GSEA algorithm identified Butyrate metabolism as a significantly enriched pathway, whereas the mummichog algorithm did not. Further inspection (Figure S4) indicated that the mummichog algorithm only utilized the three significant *m*/*z* features to calculate the enrichment score; while GSEA utilized all 20 compound hits (corresponding to 38 *m*/*z* features). Of these features, 145.04962 *m*/*z* was putatively annotated as (S)-2-Aceto-2-hydroxybutanoate (a deprotonated ion), as was 205.07102 *m*/*z* (a formic acid adduct). Furthermore, 124.03917 *m*/*z* corresponded to 2-Butynoate. This demonstrates the ability of GSEA to pick up on subtle changes, such as perturbations in Butyrate metabolism, and the utility of using both algorithms to gain biological insights.

We further examined the 17 features that overlap between the putatively annotated features in the pathway analysis and the important features found in univariate statistical analysis. Notably, 431.3164 *m*/*z* was putatively annotated as a deprotonated ion of 3-β, 7-α-dihydroxy-5-cholestenoate based on its correspondence to the exact mass of C17336 from the KEGG database [26]. This compound is found in the primary bile acid pathway. Additionally, the same mass also corresponds to a deprotonated ion of 23S, 25, 26-trihydroxyvitamin D3 (CE2202). Exact identification of this feature requires further experiments, which is beyond the scope of this manuscript. In addition to this compound, five additional compounds out of the 17 have been previously found as stool metabolites in the context of IBD [27]. Representative EICs, boxplots, and corresponding information, such as *m*/*z*, retention time, and *p*-values, are highlighted in the Supplemental Materials (Figure S5).

#### **3. Discussion**

In this paper, we have described the new functions introduced in MetaboAnalystR 2.0 to support global metabolomics data analysis, covering raw LC-MS spectra processing to generation of biological insights. These functions were showcased through two case studies.

For the benchmark dataset, despite applying the same parameters used by Li et al. [25], we were unable to reproduce the identification and quantification performance obtained by the original authors using XCMS Online. Their setup detected >14,000 (68%) more features compared to those obtained using our pipeline. We tried several options, including the suggested parameters for a HPLC or UPLC coupled with a Q Exactive HF mass spectrometer. We posit this incongruity arose because the authors did not specify the exact peak width used, which is a critical parameter for peak picking. Additionally, the data conversion step from .RAW to mzML used in our workflow may have resulted in a slight difference in the input data when compared to the data conversion used in XCMS Online. It is also important to note that our workflow integrated the latest version of XCMS (version 3.4.4), which has introduced many new functionalities and updates in existing functions. Overall, our preprocessing workflow performed well, executing peak picking, annotation, and filtering on the eight benchmark samples in less than twenty minutes.

For the IBD case study, we observed a clear separation in the metabolomic profiles between pediatric CD patients and healthy controls using either PLS-DA or OPLS-DA. Furthermore, our analysis highlighted several metabolic pathways associated with CD, without performing accurate metabolite identification. For instance, alterations in bile acid biosynthesis and short-chain fatty acids metabolism are well known among IBD patients [28,29]. Combining the results of pathway analysis and statistical analysis also putatively identified some promising metabolic features that could be used to as potential biomarkers. In addition to bile acids, vitamin D has been shown to play an immunomodulatory role in IBD pathogenesis [30]. Taken together, this use case demonstrates the ease of which MetaboAnalystR 2.0 can be utilized to gain mechanistic insights and generate hypotheses for future experimental validation.

MetaboAnalystR 2.0 has addressed the needs for high throughput raw spectra processing and inferring pathway dysregulation directly from high-resolution MS1 data. A future direction of our workflow includes the integration of MS2 data to support targeted annotations for important peaks assigned to pathways of interest. The function will be developed in coordination with the MetaboAnalyst web server to provide online visual analytics support for molecular networking [31].

#### **4. Conclusions**

The previous version (v1.0) of MetaboAnalystR features comprehensive normalization and statistical methods inherited from the MetaboAnalyst web server. The version 2.0 not only integrates XCMS and CAMERA to support raw MS spectral processing and peak annotation, but also implements mummichog and GSEA methods for prediction of pathway activities. The performance of this workflow was evaluated on a published benchmark dataset as well as a recent clinical study on IBD. The MetaboAnalystR package is maintained in conjunction with the cloud-based MetaboAnalyst web application and is under continuous development based on the community feedback. Our next focus is on integration with MS2 data as well as development of a Galaxy-based platform for raw data processing [32].

#### **5. Materials and Methods**

#### *5.1. Spectral Processing*

Three main wrapper functions have been implemented for metabolomics data processing based on XCMS (version 3.4.4) and CAMERA (version 1.38.1) [33–35] including: (i) the *ImportRawMSData* function for reading in raw data files, (ii) the *PerformPeakProfiling* function for peak picking and alignment, and (iii) the *PerformPeakAnnotation* function for annotating isotopes and adducts in processed *m*/*z* data. These functions are described below in further detail.

The *ImportRawMSData* function reads in raw MS data files and saves it as an *OnDiskMSnExp* object. To avoid potential memory issues on a user's desktop/laptop, the function will limit the number of cores used to half of the available number of cores. The function outputs two plots: the Total Ion Chromatogram (TIC), which provides an overview of the entire spectra, and the Base Peak Chromatogram (BPC), which is a cleaner profile of the spectra based on the most abundant signals. These plots are useful to inform the setting of parameters downstream. For users who wish to view a peak of interest, an Extracted Ion Chromatogram (EIC) can be generated using the *PlotEIC* function.

The *PerformPeakProfiling* function is a wrapper of several XCMS R functions that performs peak detection, alignment, and grouping in a single step. The resulting peaks are outputted as a *XCMSnExp* object. The function also generates two diagnostic plots including a retention time adjustment map, and a PCA plot showing the overall sample clustering prior to data cleaning and statistical analysis. Users can specify several parameters such as the mass accuracy, peak width, and retention time range using the *SetPeakParam* function to optimize the peak picking function. A detailed table of suggested parameters for common LC-MS platforms is provided in Table S2.

The *PerformPeakAnnotation* function annotates isotope and adduct peaks using the CAMERA package [35]. CAMERA matches *m*/*z* features to potential isotopes and adducts based on mass using a dynamic rule set. It does not utilize any spectral databases to perform annotation. It outputs the result as a CSV file ("annotated\_peaklist.csv") and saves the annotated peaks as an *xsAnnotate* object. Finally, the peak list is formatted to the correct structure for MetaboAnalystR and filtered based upon user's specifications using the *FormatPeakList* function. This function permits the filtering of adducts (i.e., removal of all adducts except for [M + H]+/[M − H]−) and filtering of isotopes (i.e., removal of all isotopes except for monoisotopic peaks). The goal of filtering peaks is to remove degenerative signals and to reduce the file size.

#### *5.2. Prediction of Pathway Activities*

Several metabolic databases are supported at the moment including KEGG [26], BioCyc [36], etc. The main mummichog algorithm is available in the *PerformMummichog* function. Users need to specify a pre-defined cutoff based on either t-statistics or fold changes. The *PerformGSEA* function contains the GSEA implementation based on the high-performance *fgsea* R package [37].

#### *5.3. Benchmark Case Studies*

The benchmark data created by Li et al. 2018 [25] is comprised of two standard mixtures (A and B) consisting of 1100 known compounds, with four replicates per mixture. The link to this raw dataset is available in Table S3. For this manuscript, we selected the dataset that was generated from a Q Exactive HF mass spectrometry (Thermo Fisher Scientific) in positive ion mode, coupled with a Dionex UltiMate 3000 HPLC equipped with a ZORBAX Eclipse Plus C18 column (Agilent Technologies). Parameters for

our workflow were selected based on the default values provided for HPLC-Q Exactive Orbitrap data on XCMS Online (mass error: 5 ppm and peak width: 10-60 s).

The second dataset consists of pediatric IBD stool samples obtained from the Integrative Human Microbiome Project Consortium (iHMP) [38]. The original study included samples longitudinally collected from IBD patients and non-IBD controls over 50 weeks. The link to this raw dataset is provided in Table S3. For our evaluation purpose, we collected samples that met the following criteria for the diseased group: (i) age between 6 and 19, and (ii) diagnosed with Crohn's disease. Samples obtained at the earliest clinical visit of each patient who met criteria (i) and (ii) were included in our study. For the healthy control, samples of non-IBD individuals between age 6 and 19 collected during their first and second clinical visits were included. The dataset was generated from a Q-Exactive Plus Orbitrap mass spectrometer (Thermo Fisher Scientific) in negative ion mode, coupled with a Nexera X2-U-HPLC system (Shimadzu Scientific Instruments) equipped with an ACQUITY BEH C18 column (Waters).

All raw data in .RAW format were converted into .mzML format using ProteoWizard 3.0 MSConvert [39] with parameters summarized in the supplemental Materials (Table S4). Following the spectral processing described earlier, data cleaning and statistical analysis were performed on the clinical data using various functions within MetaboAnalystR. Firstly, missing value imputation was performed by replacing them with half of the minimum value found for each feature. Features containing more than 50% missing values across all samples were excluded. Features with nearly constant values across samples were also filtered out based on the inter quantile range (IQR), which removed approximately 25% of total features. Subsequently, value of each feature was normalized with the median value of all features per sample to account for variable water content of stool samples. Finally, generalized log-transformation and auto-scaling were applied to data prior to multivariate statistical analysis. For univariate analysis, non-parametric methods (i.e., Mann–Whitney U test and fold change calculation) were applied to untransformed data to avoid false positives due to data manipulation [40]. A minimum fold change >2 and <0.5, and a false discovery rate (FDR) adjusted *p*-value of 0.05 were used as cut-off values. To infer pathway activities, we applied both mummichog and GSEA to predict pathway activities. The human BiGG and Edinburgh Model (hsa\_mfn) library was selected as the pathway database, with the *p*-value cutoff set to 0.05 and the instrumentation accuracy set to 5 ppm.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/3/57/s1, Table S1: Characteristics of pediatric IBD patients and healthy controls included in this study; Table S2: Suggested peak picking parameters for commonly used LC-MS platforms; Table S3: Raw datasets used in the Case Studies; Table S4: Parameters used to convert .RAW files to mzML format on ProteoWizard MSConvert; Figure S1: PCA plot of pediatric IBD stool metabolome. Data including 4113 features were median-normalized, log-transformed, and auto-scaled; Figure S2: PLS-DA plot of pediatric IBD stool metabolome. Data including 4113 features were median-normalized, log-transformed, and auto-scaled; Figure S3: Ten-fold cross validation of PLS-DA model (Figure S3) generated from the pediatric IBD stool metabolome data; Figure S4: Boxplots of *m*/*z* features used for functional interpretation; Figure S5: Representative EICs and boxplots of compounds differentially excreted in stool samples of healthy children and pediatric CD patients based on pathway analysis and Mann–Whitney U test (FDR adjusted *p*-value < 0.05).

**Author Contributions:** Conceptualization, J.X.; Data curation, M.Y.; Formal analysis, J.C. and M.Y.; Funding acquisition, J.X.; Methodology, J.C., M.Y. and J.X.; Supervision, J.X.; Writing—original draft, J.C. and M.Y.; Writing—review & editing, J.X.

**Funding:** This research was funded by Genome Canada, Génome Québec, Natural Sciences and Engineering Research Council of Canada (NSERC), and Canada Research Chairs (CRC) Program.

**Acknowledgments:** We gratefully acknowledge the developers of XCMS and CAMERA, Steffen Neumann and Johannes Rainer, for their valuable contribution to the metabolomics community.

**Conflicts of Interest:** The authors declare no conflicts of interests.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Review* **Metabolic Modeling of Human Gut Microbiota on a Genome Scale: An Overview**

**Partho Sen 1,2,\* and Matej Orešiˇc 1,2**


Received: 25 November 2018; Accepted: 24 January 2019; Published: 28 January 2019

**Abstract:** There is growing interest in the metabolic interplay between the gut microbiome and host metabolism. Taxonomic and functional profiling of the gut microbiome by next-generation sequencing (NGS) has unveiled substantial richness and diversity. However, the mechanisms underlying interactions between diet, gut microbiome and host metabolism are still poorly understood. Genome-scale metabolic modeling (GSMM) is an emerging approach that has been increasingly applied to infer diet–microbiome, microbe–microbe and host–microbe interactions under physiological conditions. GSMM can, for example, be applied to estimate the metabolic capabilities of microbes in the gut. Here, we discuss how meta-omics datasets such as shotgun metagenomics, can be processed and integrated to develop large-scale, condition-specific, personalized microbiota models in healthy and disease states. Furthermore, we summarize various tools and resources available for metagenomic data processing and GSMM, highlighting the experimental approaches needed to validate the model predictions.

**Keywords:** gut microbiome; meta-omics; metagenomics; metabolomics; metabolic reconstructions; genome-scale metabolic modeling; constraint-based modeling; flux balance; host–microbiome; metabolism

#### **1. Introduction**

The human gut microbiome consists of trillions of microorganisms such as bacteria, archaea, and unicellular eukaryotes [1,2]. Most gut microbes are facultative obligate anaerobes spanning between five different phyla (Bacteriodetes, Firmicutes, Proteobacteria, Verrumicrobia, and Actinobacteria), with over 1000 species already identified [3]. Several collaborative studies and large consortia such as MetaHIT [4,5], the Human Microbiome Project (HMP) [6,7], and American Gut [8] have taxonomically and functionally profiled the gut microbiome in healthy and various disease states. The composition of the gut microbiota is relatively simple at birth, it undergoes a series of changes in composition, metabolic functions and eventually matures between 3–5 years of age [9]. For any one individual, the composition of the gut microbiome tends to be stable over time. Interestingly, there is a difference in the composition of the gut microbiome within a human population [10–12]. Several genetic and environmental factors such as diet, lifestyle, geography, mode of delivery, infection, infant feeding modality (e.g. formula versus breastfed) and medication attribute to these differences, and thereby, shape the gut microbiota during the early stages of life [2,9,13].

The gut microbiome acts as an auxiliary metabolic organ. Several complex carbohydrates, not digested by the host intestinal enzymes, are passed to the microbial community, which are then metabolized in the large intestine [14,15]. The gut microbiota is involved in metabolism of short-chain fatty acid (SCFAs), branched chain fatty acids (BCFAs), branched chain amino acids (BCAAs), biogenic amines, vitamins, bile acids (BAs), and xenobiotics, as well as the production

of gases (e.g., CO2, CH4) [16–18]. Gut microbes also affect the host immune system, such as by regulating immune homeostasis versus autoimmunity [19]. Studies in germ-free mice suggest that gut microbiota can induce toll-like receptor (TLR) expression, antigen presenting cells (APCs), and differentiated CD4<sup>+</sup> T cells [20]. It also maintains the stability of the immune system by providing resistance against pathogens.

Our understanding of the gut microbiome and its role in health and disease has considerably improved with the advent of high-throughput meta-omics technologies. The wealth of data generated by the gut microbiome research, however, begs the development of novel computational tools and mathematical models. Such tools have already enabled researchers to begin exploring complexities of the gut microbiome (Table 1). Several approaches, such as 16S rRNA amplicon sequencing and whole genome shotgun metagenomics sequencing (WGS) have already been used for profiling gut microbes [21]. However, such genome-centric approaches are themselves unable to provide mechanistic insights at the level of individual species, their interactions with other gut flora, and their impact on host metabolism [14,22,23].

Genome-scale metabolic modeling (GSMM), a constraint-based mathematical modeling approach has been increasingly used to study gut ecosystems, attempting to elucidate the microbial metabolic interactions with each other and their host [15,24–26]. Recently, genome-scale models (GEMs) of catalogued human gut microbes [4,27], based on their metabolic functions, were developed. GEMs can integrate multiple type of biological information within a computational framework [28–31]. The complex interplay of genes, enzymes, and metabolites provides a scaffold for the integration of multi-omics datasets such as transcriptomics, proteomics, metagenomics, metabolomics and fluxomics (Figure 1). A GEM framework allows researchers to decipher, postulate and test hypotheses linking genotype to phenotype [28–30]. Overall, it provides a comprehensive systems biology platform for modeling and analyzing biological systems.

**Figure 1.** Overview of meta-omics profiling, annotation and genome-scale metabolic reconstructions. (**A**) Fecal, plasma and/or serum samples are taken from healthy and diseased subjects and meta-omics data is generated from these. (**B**) Taxonomic and functional profiling of gut microbes. (**C**) Reconstruction of microbial GEMs. Contextualization and personalization of GEMs with meta-omics datasets. (**D**) Summary of host-microbial interactions in the human gut. GEM simulations to study and understand the intricate relationship among diet, host and microbiota under healthy and disease states.

Herein, we review the role of GSMM in understanding microbial metabolism in the human gut, with a focus on how GEMs have been used to infer diet–microbiome, microbe–microbe and host–microbiome interactions under physiological conditions. We discuss metagenomics profiling, and how meta-omics datasets can be used for building condition-specific personalized community models of gut microbiota. We further summarize the available tools for metagenomic profiling and GSMM. Finally, we highlight and emphasize the experimental techniques and data required to validate the GEM-based predictions.

#### **2. Colonization and Shaping of the Gut Ecosystem**

Early colonization of the gut microbiota in infants is vital for shaping of the intestinal ecosystem at a later age [2,32]. These processes are driven by multiple factors such as mode of delivery, gestational age, maternal diet, environment and host genetics. Additionally, geography, life style, age, certain diseases and drug usage can all affect the gut microbial composition and function [2,33].

The distribution of microbes along the gastrointestinal (GI) tract is non-random, in that, certain species of microbes are co-localizing. *Lactobacillacea*, *Veilonellaceae* and *Helicobacterceae* co-occur in stomach, *Bacillaceae* and *Streptococcaceae* in the small intestine, and *Bacteroidaceae*, *Clostridium*, *Lactobacillaceae* and *Bifidobacterium* in the colon [34]. Dysbiosis in the intestinal ecosystem has been both directly and indirectly linked to autoimmune diseases (e.g., type 1 diabetes (T1D), rheumatoid arthritis (RA)) [35,36], colon cancer [37], type 2 diabetes (T2D) and obesity [5,25], cardiovascular disorders [38], non-alcoholic fatty liver disease (NAFLD) [39,40] as well as inflammatory bowel disease (IDB) [41].

#### **3. Gut Microbiome Profiling and Functional Annotation**

Metagenomics shotgun sequencing [42] and 16S rRNA amplicon sequencing [43] have been used for profiling gut microbiota from fecal (stool) samples. An appropriately annotated shotgun metagenomics dataset can be used for accurately mapping and predicting microbiota-affected metabolic pathways. These approaches also have proven potential for novel gene discovery [44] and identification of essential functions. Annotation of metagenomics datasets is primarily carried out in two ways: (a) by assembling nucleotide sequences from NGS reads of appropriate length and subsequently predicting the protein coding sequences (called CDS) [45], and (b) by mapping the reads to genome or non-redundant marker gene sets of the relevant organisms guided by the taxonomic profiling [46]. These genes can be clustered, catalogued and aligned against reference database(s) of annotated gene/protein families (e.g., KEGG Orthology [47]), and/or they can be linked to metabolic pathways (e.g., MetaCyc [48]).

Various computational tools and pipelines have been developed for these sorts of purposes. MOCAT2, for example, provides automated annotation of non-redundant reference catalogues from 18 databases covering various functional categories [45]. HMP Unified Metabolic Analysis Network (HUMAnN2) is a pipeline for profiling the relative abundances of microbes and the activity of their metabolic pathways from metagenomics data [46,49]. MEtaGenome ANalyzer (MEGAN) is an interactive and comprehensive microbiome analysis toolbox, that allows researchers to explore and analyze large-scale metagenomics datasets both from taxonomic and functional perspectives [50]. Metagenomics Rast (MG-RAST), is a RAST (Rapid Annotation using Subsystem Technology) server for automated annotation of metagenomics datasets [51]. Integrated Microbial Genomes & Microbiomes (IMG/M) is another server-based system that supports the annotation and analysis of microbiome datasets [52]. There is a plethora of tools for sequence assembly, gene prediction and phylogenetic classification which underpin many of these processes, and these tools are extensively reviewed elsewhere [53].

Functional annotation of metagenomics datasets poses several challenges in itself [53,54]. Although metagenomics data categorizes microbial functions at the community level, it fails to suggest a mechanistic explanation for how these functions arise. To understand the intricate relationship between microbial components, such as genes, proteins and metabolites, and their influence on

host metabolism via different biochemical pathways, microbe-specific metabolic models need to be developed at the genome scale.

#### **4. A Constraint-Based Strategy and Tools for Genome-Scale Metabolic Modeling of Gut Microbiota**

A rapid increase in use of shotgun metagenomics, the availability of model organisms, and the number of meta-omics datasets in public repositories, gives an opportunity to develop metabolic reconstructions of human gut microbes. These reconstructions can be converted into quantitative mathematical models that can be used to study metabolism at the genome scale [28,55–58]. Current tools and resources for gut microbiome modeling are listed in Table 1.


**Table 1.** Tools and resources for genome-scale metabolic modeling.

In a GEM, uptake or secretion of certain metabolites over time (denoted as their 'flux'), enzymes/transcript abundances and ON/OFF gene expression can be constrained using information from datasets generated by quantitative fluxomic, metabolomic, transcriptomic and proteomic

experiments. By applying these constraints, GEMs can be contextualized to a particular state or condition. These condition-specific/contextualized models can provide information about the activity of metabolic pathways, metabolite flux, cellular growth, and provide estimates of the overall metabolic capacities of these gut microbes. GSMM use FBA [28], a constraint-based approach (CBA), to predict organisms' phenotypes [28]. A tutorial on linear programming and FBA is available in [28].

GSMM has been applied to study gut microbial metabolism and its interactions with the host. Recently, AGORA (Assembly of Gut Organisms through Reconstruction and Analysis) was published, which carried out semi-automatic metabolic reconstruction of 773 human gut bacteria (205 genera, 605 species) [26]. The authors modeled metabolic interactions among microbial species based on their metabolic potential and availability of nutrients. This approach has identified and defined growth medium for *Bacteroides caccae ATCC 34185*. Moreover, these metabolic reconstructions have been used to infer metabolic diversity of microbial communities. The AGORA framework can be coupled with, for example Recon 2, a generic reconstruction of human metabolism, which in turn can be used to study host–microbiome interactions. AGORA reconstructions are publicly available via the Virtual Metabolic Human (VMH) [74] database (https://vmh.life/). In addition, BiGG Models [73] (http://bigg.ucsd.edu/) and the Human Metabolic Atlas [76] (http://metabolicatlas.org/) are other open access knowledge bases for metabolic reconstructions.

Kbase [63] (https://kbase.us/) and ModelSEED [75] (http://modelseed.org/) are the web-based servers for automatic reconstruction of microbial GEMs by integrating genome sequences and/or metagenomics datasets. The COnstraint-Based Reconstruction and Analysis (COBRA) [59–61] and RAVEN (Reconstruction, Analysis, and Visualization of Metabolic Networks) [62] toolboxes are stand-alone MATLAB software suites with collections of basic and advanced functions for genome-scale reconstructions and modeling. The Microbiome Modeling Toolbox [82] extends the functionality of the COBRA toolbox to use metagenomic data for modeling microbe–microbe/host–microbe metabolic interactions and modeling personalized microbial communities. Draft GEMs generated by these platforms are then curated for the occurrence of genes, metabolites, reactions and their associations based on evidence from the literature and expert knowledge of metabolism. Quality control checks, which are performed to eliminate false positives, also enhance the predictability of GEMs [55].

#### **5. Reconstruction of Condition-Specific Personalized Gut Microbiota Models**

In a metabolic model, numerous genes and metabolites are associated by way of metabolic pathways deemed to be thermodynamically feasible. These models are formalized and applied over the entire microbiota community model [82]. Various efforts have already been made to integrate metagenomic data with a genome-scale framework [26,83]. However, approaches to integrate other kinds of meta-omics data are still in the early phases of development.

Shotgun metagenomics and 16S rRNA data have guided the selection of representative microbes (species or strains) in a community [24]. Integration of meta-omics datasets such as metatranscriptomics, metaproteomics together with fecal metabolomics with the microbiota metabolic modeling framework can constrain the model, improving the accuracy of its representation of the biological system. Moreover, meta-omics data can be applied to develop condition-specific microbiota models (Figure 1) such as metabolic reconstruction of gut microbiota in lean vs. obese subjects. Likewise, a microbiota model can be personalized for an individual subject by combining the metagenomics information with other phenomics datasets. Metagenomics, metatranscriptomics and metaproteomics data can provide an estimate for enzymatic and pathway activities in the gut [49], which approximate the metabolic activity in the gut of an individual under specified conditions.

Context-based, personalized microbiota models have already been used to study various conditions [28,55,56,61,84]. An array of analysis can be performed with these models. Flux Variability Analysis (FVA) [28,85] can estimate the maximal and minimal possible flux differences (flux span) for a specific metabolic exchange reaction of a specific microbial strain, pair of strains, or community as a whole. It determines the potential of a reaction to carry out flux under the applied constraints/conditions. FVA can thus be used to compute strain-specific exchange fluxes for a particular metabolite that can be compared with the net metabolite exchanges in the community. Moreover, it can evaluate the role of individual microbe for metabolite production. On the other hand, shadow price (SP) of a metabolite determines whether it is limiting for an optimal objective function (growth or biomass production) [28,61]. A negative SP suggests that flux through the objective function would increase with the increase in the concentration of the metabolite. As an example, SP analysis has already identified several microbial strains that decrease ursodeoxycholate (UDCA) biosynthesis by limiting its precursors [83].

Food metabolomics datasets detailing dietary constituents have been used to constrain the nutrient uptake rates of microbiota models [58]. Diet acts as a 'spooning media' for the microbiome. Several diets such as a typical Western diet, high fiber diet [26], average European diet [26], breast milk [58], and Ready-to-Use Therapeutic Foods (RUTFs) [24], have been designed. The diet designer tool included as part of the aforementioned [74] can be used to calculate range of dietary fluxes, given the metabolite concentrations. On the other hand, fecal, serum and plasma metabolomics data can be used to confirm the identity of microbial metabolites produced by the models [24,25].

#### **6. Modeling the Effect of Diet on Gut Microbiome**

Diet is the direct regulator of microbial metabolism in the gut ecosystem; dietary patterns have profound effect on gut colonization and the shaping of the gut microbiome during the early stages of life [9]. Western diets are associated with a *Bacteroides* enterotype whereas plant-based polysaccharides are associated with a *Prevotella* enterotype [86]. Mostly, three primary macronutrients such carbohydrates, proteins, and fats are known to affect the gut microbial composition [18].

GSMM has already begun to be used to help improve mechanistic understanding of gut microbial metabolism and its dietary interactions [24–26]. Computational tools such as COMET [65], BacArena [64], dOptCom [68], MatNet [87], DyMMM [67], MCM [66], and CASINO [25] were designed to study diet–microbiome interactions. CASINO was able to predict the interactions along the diet-microbiota-host axis in 45 obese and overweight individuals [25]. Furthermore, this study estimated the metabolic capabilities of microbes in the lumen of obese and overweight individuals. The model predicted a significant change in the amino acids and SCFAs levels in response to dietary intervention. The model predictions were further validated by fecal and blood metabolomics data. In another study, GSMM was used to predict and elucidate the underlying interactions between *Bacteroides thetaiotamicron*, *Eubacterium rectale* and *Methanobrevibacter smithii*, when subjected to different gut ecosystems [15,22]. Recently, GEM-based predictions were used to evaluate the effect of RUTFs on gut microbiome of healthy and malnourished children from Bangladesh and Malawi [24]. This methodology can be further extended to study the effect of health supplements, prebiotics and probiotics on the human gut microbiota.

#### **7. Multispecies Modeling and Interactions in the Gut Community**

Microbial species or strains with high abundances in samples are often selected for pairwise or community modeling [24,26]. Two or more microbial GEMs are joined together along their extracellular compartments to build a community model [82]. The community model is linked to a "common compartment" mimicking the human gut, through which exchange of metabolites takes place. A community biomass, i.e., the sum of biomasses estimated for each microbe, and coupling constraints are added [82].

Pairwise analysis of microbes in the community has determined their metabolic relationships when introduced to different types of diets [24,26,83]. However, in vitro screening of microbial pairs can be laborious and expensive. When subjected to Western and high fiber diets under aerobic and anaerobic conditions, pairwise modeling has predicted six different interactions between gut microbes such as competition, parasitism, amensalism, neutralism, commensalism and mutualism [26]. Furthermore, pairwise models developed from personalized gut microbiomes have been interrogated for single, cooperative, and community-wide bile acid production potential [83]. This strategy has identified several microbe pairs producing secondary BAs. For instance, *Bacteroides spp.* and *R. gnavus* can cooperatively produce UDCA [83]. In another study, the rate of butyrate production increased by pairs of microbes as compared to a single species, when studied in the gut communities of healthy Bangladeshi and Malawian children [24].

Alternatively, correlation-based co-occurrence topological networks looking at abundant metagenomic species can be developed [88,89]. Such a network can predict positive or negative associations between the microbes. Microbe–microbe co-occurrence pairs of interest can be selected and evaluated by in vitro co-culture experiments [90]. Interestingly, co-occurring species compete strongly for metabolic resources, which are required for cellular growth and maintenance. In this context, the network analysis can be extended to incorporate different metrics such as competition and complementarity indices, which can be used to further characterize/quantify the degree of metabolic interactions between the selected pairs of microbes.

#### **8. Metabolic Modeling of Host–Microbiome Interactions**

Gut microbiota can harvest nutrients and energy from the diet. During these processes, small molecules (metabolites) are produced. Some of these metabolites can be beneficial for host and microbial symbionts [16,18,84]. One such metabolite is butyrate, a bacterial fermentation product that fuels the colonic epithelium [22]. In fact, butyrate is the primary energy source for colonocytes. In mammals, the production of cresols from tyrosine have been linked to various species of *Clostridium*, *Bifidobacterium*, and *Bacteroides*, and altered 4-cresol levels in human urine have been associated with weight loss in IBD [17]. The primary conjugated BAs produced by liver are deconjugated and biotransformed by gut microbes, affecting host signaling and metabolism [83]. Also, BAs can activate the innate immune genes which in turns alters the gut microbial composition. It also inhibits the growth of pathogens in the gut.

GEMs have been expanded to study metabolism in humans. Human generic metabolic reconstructions such as Recon 1 [91] and the Edinburgh Human Metabolic Network (EHMN) [92] were developed with a vision to integrate and analyze biological datasets. Similarly, Recon 2 [56,93] and Recon 3D [94], and Human Metabolic Reaction (HMR) [95,96], were designed, that comprehensively captured human metabolism. A metabolic reconstruction of human small intestinal epithelial cells (sIECs) was assembled and manually curated [97]. sIECs were used to study the physiological functionality of the small intestine and their overall role in human metabolism. These models incorporate transporters present in the human gut [94,97,98], while some of them are putatively identified. Furthermore, several functional cell or tissue-specific GEMs have been generated for the liver [96], brain [99], adipocytes [95] and myocytes [100], using semi-automated approaches [101]. In addition, a gender-specific, whole-body metabolism (WBM) reconstruction was developed to capture and characterize the metabolism of 20 human organs [102]. A WBM framework can be constrained with dietary, physiological parameters and omics datasets. Such a framework was used to link organ-level metabolic processes in 149 subjects induced by their gut microbiota.

The Microbiome Modeling Toolbox [82], deployed under the COBRA suite, includes several functions for modeling complex metabolic interactions between the host and gut microbiota. It can integrate microbe (AGORA [26], BiGG [73]) and host (Recon [56,91,94]) metabolic reconstructions. Similarly, a common compartment mimicking the human gut is added, which enables pooling and exchange of metabolites between the microbes, lumen and the host cells.

In a different context, the microbiome-induced immune response is currently well established. An imbalance in gut microbial composition has been linked to inflammatory and autoimmune diseases [103–106]. Various immune cells including CD4+ effector T cells (particularly Th1, Th2, Th17 and iTreg), CD8<sup>+</sup> T cells (cytotoxic) and macrophages undergo metabolic reprogramming during proliferation and differentiation processes [107]. The macrophage (RAW 264.7 cell line) model was

developed to study immunoactivation and immunosuppression [108]. Metabolic reconstructions of immune cells are currently unavailable. By developing GEMs for host immune cells [57], might guide us to study, the microbiome-mediated immunometabolic responses under various health/disease conditions.

#### **9. Model Predictions and Experimental Validation**

To establish the biological relevance of metabolic models, the congruence between model predictions and experimental data is of utmost importance. GEM-based predictions can be validated by existing data, knowledge and bibliographical evidence. For instance, metabolites secreted by gut microbiota can be compared with the concentrations of metabolites found in fecal and blood samples [24,25]. Furthermore, blood metabolomics data can be used for validation of metabolites predicted as being transported across the human gut. Meta-omics datasets [109] can be used to estimate the abundances of gut enzymes and microbial pathways for an individual species or strain [49]. The pathway abundances can be compared with the enrichment and usage (flux) of GEM-predicted pathway(s). GSMM can be applied to quantify dietary nutrient uptake of gut microbes and their metabolic interactions with the host. To understand the regulation of host metabolism by gut microbes, germ-free (GF) and conventionally raised (CONV-R) mice are usually used [110]. These mice can be raised on different diets and then euthanized, with samples analyzed by meta-omics analyses. The generated datasets can be used for contextualization and validation of GEMs. Furthermore, the theoretical growth rate of a microbe can be validated by culturing species in a specific media [25,26]. In addition, the predicted metabolic interactions between microbes, regulation of co-occurrence network, and dietary cross-feeding can be validated by mono- and co-culture experiments [90].

#### **10. Concluding Remarks and Future Perspectives**

Integration of meta-omics datasets and genome-wide metabolic reconstructions provide a framework for interrogating and suggesting mechanistic workings of diet-microbe-host metabolic interaction. However, such integrative methods are still evolving and require extensive and robust experimental validation.

Profiling and culturing gut microbes at the strain level, under controlled conditions, remains challenging. Recently, an integrated approach involving targeted phenotypic culturing, WGS, phylogenetic analysis and computational modeling has succeeded in culturing a substantial portion of bacteria previously declared to be 'unculturable' under laboratory conditions. This approach identified 137 bacterial species, including novel species isolated from pure cultures [111]. Furthermore, the culturomics techniques are currently used for filling the gap by isolating the unknown or novel members of the gut community [111,112].

In studies of gut microbial communities, there is increasing interest in mechanistic approaches, in contrast to solely genome-centric approaches. Correspondingly, GSMM is widely used as a preferred computational method for studying gut microbial metabolism and its interaction with the host. Additionally, GEMs can be contextualized and personalized using longitudinal meta-omics datasets, providing a snapshot of metabolic processes over time. Personalized microbiota models may help to reduce the costs of clinical studies, predict markers and contribute to the development of potential treatments at either the individual patient level, or for a defined patient group [83,113]. Many efforts are ongoing, aiming to couple pharmacokinetic and constraint-based models to study drug-microbe-diet interactions [114]. However, a limitation of GSMM approach is that GEMs are stoichiometric models, and cannot, in their current form at least, incorporate metabolite concentrations or enzyme kinetics (Vmax, Km, Kcat) [115,116]. Albeit more limited in scope, kinetic modeling [116] may help improve understanding of the dynamics of metabolic pathways in the human gut.

As indicated in this review, GSMM and CBA have provided computational tools and frameworks to study metabolism of gut microbiota. These tools guided researchers to study and identify the metabolic functions of individual microbes in the gut community. It also helped to infer their spatial dynamics, environmental interactions and metabolic resource allocations under a certain condition. We believe that, a combination of several computational and experimental approaches, may reveal the complex and diverse structure of the human gut microbiome and its underlying interactions with the host metabolic machinery. It might bridge the gaps in gut microbiome research and thereby, enhance our knowledge of human gut microbiota under health/disease conditions.

**Author Contributions:** P.S. drafted the manuscript. M.O. provided critical comments and edited the manuscript. Both authors approved the final version of the manuscript.

**Funding:** This work was supported by the Academy of Finland (Centre of Excellence in Molecular Systems Immunology and Physiology Research 2012–2017, Decision No. 250114, to M.O.) and the Juvenile Diabetes Research Foundation (2-SRA-2014-159-Q-R to M.O.) and the European Union Horizon 2020 project 'Elucidating Pathways of Steatohepatitis – EpoS' (Grant Agreement 634413).

**Acknowledgments:** We thank to Alex Dickens and Santosh Lamichhane for helpful scientific discussions. We also thank Aidan McGlinchey for editing the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**



#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Protocol* **"Notame": Workflow for Non-Targeted LC–MS Metabolic Profiling**

**Anton Klåvus 1,\*,**†**, Marietta Kokla 1,\*,**†**, Stefania Noerman 1, Ville M. Koistinen 1, Marjo Tuomainen 1, Iman Zarei 1, Topi Meuronen 1, Merja R. Häkkinen 2, Soile Rummukainen 2, Ambrin Farizah Babu 1, Taisa Sallinen 1,2, Olli Kärkkäinen 2, Jussi Paananen 3, David Broadhurst 4, Carl Brunius 5,6 and Kati Hanhineva 1,5,7,\***


Received: 2 March 2020; Accepted: 28 March 2020; Published: 31 March 2020

**Abstract:** Metabolomics analysis generates vast arrays of data, necessitating comprehensive workflows involving expertise in analytics, biochemistry and bioinformatics in order to provide coherent and high-quality data that enable discovery of robust and biologically significant metabolic findings. In this protocol article, we introduce notame, an analytical workflow for non-targeted metabolic profiling approaches, utilizing liquid chromatography–mass spectrometry analysis. We provide an overview of lab protocols and statistical methods that we commonly practice for the analysis of nutritional metabolomics data. The paper is divided into three main sections: the first and second sections introducing the background and the study designs available for metabolomics research and the third section describing in detail the steps of the main methods and protocols used to produce, preprocess and statistically analyze metabolomics data and, finally, to identify and interpret the compounds that have emerged as interesting.

**Keywords:** metabolomics; LC–MS; mass spectrometry; metabolic profiling; computational statistical; unsupervised learning; supervised learning; pathway analysis

#### **1. Introduction**

The rapid technical development of instrumentation for biomolecule analysis has led to a wide application of metabolomics in biological and biomedical research. Due to its very high sensitivity and the ability to concomitantly assess thousands of molecular features, liquid chromatography coupled with mass spectrometry (LC–MS) is making its way as the key analytical tool in the field of discovery-driven metabolic profiling [1–3]. The LC–MS platform generates large amounts of signals—biological signals from metabolites, their adducts, fragments, isotopes and instrument noise, thereby necessitating adequate computational tools to process, analyze and interpret the data [4,5]. Although the data processing solutions for complex metabolomics data are accumulating with increasing speed, they continue to be the bottleneck within the analysis, especially the metabolite identification process [6–8]. Starting from the acquisition of data to the identification of metabolites, the metabolic profiling workflow involves numerous steps that require expertise in analytical chemistry, biochemistry, bioinformatics and data analysis—click-and-go online tools may therefore not provide adequate reliability. To guarantee high quality output from metabolomics experiments, cooperation of scientists with various backgrounds and expertise is needed.

First, the production of high-quality metabolomics data requires high quality samples originating from studies with meaningful research questions, adequate sample preparation and know-how in operating MS instruments in order to get out the maximum performance of the sensitive measurements. The acquired data needs to undergo several preprocessing steps, starting from data collection (peak picking), where it is imperative to understand the detection threshold and signal-to-noise ratios of the measurement. This is then followed by a multi-step processing phase involving imputation, normalization, data reduction and clean-up, which determines the quality of the data that is used in downstream data-analysis, metabolite identification and biological interpretation of the results. All of these steps need to follow necessary quality assurance and quality control procedures for reliable outcome of the metabolomics analysis [9,10]. Finally, the compounds that have emerged as interesting in the given study setup need to be identified using a combination of automated metabolite identification algorithms and exploration of the raw LC–MS/MS spectral data.

Although the currently proposed non-targeted metabolic profiling workflow is applicable on basically any metabolomics study, it has been developed and utilized mainly on food and nutritional approaches. Therefore, examples provided here on the presentation of results are from studies within that field. In fact, food and nutrition sciences encompass a versatile array of research fields, which have adopted metabolomics as one of the most important analytical tools during the past decade [9]. For example, metabolic profiling allows a comprehensive analysis of the chemical composition of food and estimating the impact of industrial processing and modifications by gut microbiota [11,12]. Likewise, when assessing the actual health outcomes of certain diets or specific foods, metabolic profiling enables pointing out the areas of metabolism that are reflecting the dietary differences; especially when data are correlated with other, traditional clinical variables, they may raise novel hypotheses on the molecular-level linkage between diet and health [13–15].

Here, we present analytical workflows suitable for any non-targeted metabolic profiling study in a systematic manner (Figure 1), with a major focus on data-analysis challenges. We also present a new R package: notame (version 0.0.1, https://github.com/antonvsdata/notame), where we have bundled many of the data-analysis tools used in our lab so that they are easy to adopt for other scientists working in the field of metabolic profiling. This includes the pre-processing steps and visualizations in Section 3.2.2, Section 3.2.3, Section 3.2.4, Section 3.2.5, statistical tests and multivariate models in Section 3.3, as well as the visualizations in Section 3.4. The package documentation contains extensive instructions for using the package, along with a template script for preprocessing and analyzing data from a single-batch LC–MS experiment as well as a small example dataset.

**Figure 1.** A general overview of notame workflow containing four important stages; 1. Experimental designs and sample collection, 2. sample preparation, 3. data acquisition, 4. data analysis and biomarker identification analysis.

#### **2. Experimental Design**

The non-targeted metabolic profiling analytical workflow presented here includes steps from sample preparation and LC–MS analysis all the way to metabolite identification (Figure 1). It is noteworthy to mention, however, that the study design and careful planning for the sampling are very important part of the study governing the quality of the results and therefore require special attention [9]. Herein, we focus on metabolomics analysis performed in one batch (where the number of samples typically reaches 200–300 samples). However, the procedures are in general applicable for larger, multi-batch experiments, although extra procedures for quality control are in order [10,16].

#### *2.1. Materials*

Sample preparation materials:


LC–MS materials:


Reagents:


#### *2.2. Equipment*

The current workflow is demonstrated with one suitable LC–MS instrumentation and software combination but can likewise employ any other high-accuracy LC–MS setup.

Sample preparation and LC–MS instruments:


Software:


#### **3. Analytical Procedure and Results**

#### *3.1. LC–MS Analysis*

#### 3.1.1. Sample Preparation

Sample preparation for the non-targeted metabolite profiling work aimed to extract in a single attempt as wide range of metabolites as possible with, in general, minimal sample workup. Therefore, straightforward, simple extraction protocols were preferred. Protocol 1 was designed for extracting

plasma/serum samples at a ratio of 1:5 with ACN and Protocol 2 for extracting homogenized tissue samples at a ratio of 1:6 with 80% methanol.

**Protocol 1:** Plasma/Serum Samples


#### **Protocol 2:** Tissue Samples


#### 3.1.2. LC–MS Measurement

The most commonly applied analytical technique in non-targeted metabolic profiling is mass spectrometry, often combined with liquid or gas chromatographic separation at the front end. In order to cover a wide range of polarities among the analyzable metabolites, different chromatographic methods may be utilized, e.g., reversed-phase chromatography (RP) and hydrophilic interaction chromatography (HILIC). MS data can then be acquired in both positive (+) and negative (−) electrospray ionization (ESI) polarities.


#### *3.2. Data Collection and Preprocessing*

The data collection (peak picking) and subsequent preprocessing of the raw data are critical steps in non-targeted metabolomics data-analysis since they determine the quality of the data for all the remaining steps (Figure 2). Various peak picking algorithms exist, utilized by vendor-specific and open-source software as well as freely available online services. Widely used examples of open-source software include XCMS (and XCMS Online), MZmine and MS-DIAL. In this workflow, MS-DIAL (http://prime.psc.riken.jp/Metabolomics\_Software/MS-DIAL/) [17] is used for the peak picking; it has user-friendly interphase and contains advanced tools for signal filtering, metabolite annotation, chromatogram curation and visualization. After collection of the raw data, pre-processing is required to monitor the quality of the data, make any required transformations/corrections to the data, as well as reduce/merge the number of features originating from the same metabolite.

**Figure 2.** Workflow of the statistical analysis after the peak-picking step. The choices depend on the type of data, the research question and the study design. The tools used for specific steps are listed on the right side of the respective steps. Italicized names are names of external R packages, names ending with () are major functions from the notame package. For more details, see the package documentation.


#### 3.2.2. Drift Correction and Flagging Low-Quality Features

LC–MS-based metabolomics suffers from systematic intensity drift during an LC–MS run. This means that the signal intensity of a molecular feature either decreases or increases systematically throughout the experiment. Removing this drift increases the quality of LC–MS data and allows estimating the true biological effects more accurately. Unfortunately, some molecular features show too much variation in the QC intensities even after drift correction. We use here different quality metrics defined by Broadhurst et al. [10] for measuring the quality of a molecular feature before and after drift correction. Low-quality features are flagged and not included in downstream data analysis. Note that we do not recommend removing low-quality features completely, as they are sometimes needed in the metabolite identification phase when searching for specific ions or fragments of known molecules.


overall curvature of the drift function. The smoothing prevents the spline from overfitting the drift function in the presence of a few deviating QC samples (see Figure 4). A suitable value for the smoothing parameter is chosen by leave-one-out cross validation. For the R function smooth.spline, [24] we recommend the smoothing parameter to be between 0.5 and 1.5.

**Figure 3.** A molecular feature before (**a**) and after (**b**) drift correction by smoothed cubic spline regression. The horizontal lines represent 2 standard deviations from the mean of quality control (QC) samples and biological samples, respectively. The systematic effect of the drift is reduced upon correction.

39. Correct the abundance of each sample using the following formula (for a sample with injection order *i*):

$$\mathbf{x}\_{\text{corrected}}(i) = \mathbf{x}\_{\text{original}}(i) + mean(\mathbf{x}\_{\text{QC}}) - f\_{dritft}(i) \tag{1}$$


**Figure 4.** A molecular feature in the presence of an outlying quality control (QC) sample (circled) before (**a**) and after (**b**) drift correction by smoothed cubic spline regression. The horizontal lines represent 2 standard deviations from the mean of QC samples and biological samples, respectively. Due to the smoothing, the correction method is robust against the deviating QC sample and adjusts seemingly adequately for the global drift trend.

#### 3.2.3. Quality Control

The raw data obtained from the peak picking software requires careful examination to estimate the need for additional preprocessing such as drift correction (see 3.2.2.). In the now proposed workflow, the data quality is monitored at each step of the preprocessing with a set of visualizations. Example figures are based on RP positive data from a dietary intervention study [25], before and after drift correction and removal of low-quality features. All the visualizations described in this section are available in notame (see the visualizations vignette for details).


**Figure 5.** The six histograms illustrate *p*-values from linear regression models between each feature and injection order. The dashed red lines represent the uniform distribution. The a.1 and a.2 histograms show the *p*-values from before (**a.1**) and after drift correction (**a.2**) in all the samples. The b.1 and b.2 histograms focus only in the biological samples before (**b.1**) and after (**b.2**) drift correction. Finally, the c.1 and c.2 histograms show only the *p*-values from the quality control (QC) samples before and after drift correction. In this case, we have a strong drift in the LC–MS data because the *p*-values of the QCs (**c.1**) tend to gather close to zero. After the drift correction, (**c.2**), *p*-values for the QCs are increased.


**Figure 6.** Boxplots of feature intensities per sample. The boxplots (**a.1**), where the samples are ordered by study group (**a.1**) and (**b.1**), where the samples are ordered by injection order and quality control (QC) samples are colored distinctly (**b.1**), show a clear systematic decrease in signal intensity during the injection sequence. After the drift correction, the drift pattern is no longer observable (in boxplots **a.2** and **b.2**).

**Figure 7.** The density plot (**a**) shows a clear overlap between the distribution of quality control (QC) samples and the biological samples, which indicates poor data quality. After drift correction and quality control (**b**), the distributions are no longer overlapping.

Principal component analysis (PCA) [27–29] or t-distributed stochastic neighbor embedding (t-SNE) [30] can be used for observing patterns in the data by drawing scatter plots of the samples in a low-dimensional space (Figures 8 and 9). PCA is a linear method, while t-SNE can also reveal non-linear patterns. Unlike t-SNE, PCA offers information on loadings, i.e., on how the principal components are constructed from original features. For these reasons, we consider PCA and t-SNE as complementary methods. For conciseness we only show t-SNE figures here.

**Figure 8.** Investigating drift correction patterns using the t-SNE method. The quality control (QC) samples are shifting systematically before drift correction (the line trend of the purple crosses symbol) (**a**), whereas after the drift correction (**b**), the line trend of the QCs is gone and the QCs are now group nicely.

**Figure 9.** The drift pattern in the injection order (the color trend) using the t-distributed stochastic neighbor embedding (t-SNE) method is visible before drift correction (**a**), whereas after drift correction (**b**), the samples are more randomly scattered.


**Figure 10.** The hexbin plots show similar patterns as the scatterplots in Figure 9: The drift pattern in the injection order (the color trend) using the t-distributed stochastic neighbor embedding (t-SNE) method is visible before drift correction (**a**), whereas after drift correction (**b**), the samples are more randomly scattered. The color of each hexagon corresponds to the mean injection order of the data points in that hexagon.

52. Apply hierarchical clustering [31,32] to the samples and visualize the result in a dendrogram (Figure 11a,b). The QC samples should cluster together early. We also draw a heatmap (Figure 11c,d) representing pairwise distances between samples, where samples on the x and y axes are ordered by hierarchical clustering. The QC samples should have smaller inter-sample distances than other samples. Several techniques can be used for clustering. However, we have consistently achieved good results with hierarchical clustering using Euclidean distances and Ward's criterion for linking clusters [32].

**Figure 11.** The hierarchical clustering algorithm clusters quality control (QC) samples together even before drift correction (**a**) whereas, after performing drift correction (**b**), the QC samples cluster more clearly together. In the heatmap after the drift correction (**d**) a QC "block" pattern (purple color code), is more clearly visible than in the heatmap before drift correction (**c**).

#### 3.2.4. Imputation, Transformation, Normalization and Scaling

Missing data occur in metabolomics datasets for various reasons and managing this missingness is highly challenging [33]. Imputation is the procedure of replacing missing data with reasonable values using a priori knowledge or information available from the existing data. In this workflow, we perform random forest (RF)-based imputation using the missForest package [33,34], although several other procedures are available [35,36]. Data distributions can affect statistical analysis, especially for variance-based models [37]. Consequently, transformation and normalization can be used to adjust for data heteroscedasticity and skewed distributions among the molecular features. Depending on the type of multivariate analysis chosen we will proceed with different normalization and transformation approaches [38], however in the case of the feature-wise univariate analysis (Section 3.3.1) only imputation is performed. All the preprocessing methods mentioned here are provided in notame (see the preprocessing vignette for details).


#### 3.2.5. Clustering Molecular Features Originating from Same Metabolite

Now used peak picking software can detect isotopes, most common adducts and some in-source fragments and combine those features into one entry in the data matrix. However, in LC–MS analysis, unpredictable adduct behavior and neutral loss formation occurs frequently, resulting in the same metabolite being redundantly represented in the data matrix, causing problems not only for the identification of the compounds but also potentially in the data-analysis step due to multiple collinearities.

We present here a method for clustering and combining these features. This approach was developed bespoke to our workflow [41]. Partially similar methods to tackle this problem have been published also elsewhere [42–44]. Features originating from the same compound are assumed to be strongly correlated and have a small difference in their retention time. Thus, the algorithm initially identifies pairs of correlated features within a specified retention time window. The user specifies both the correlation threshold and the size of the retention time window. For illustration, a correlation coefficient threshold of 0.9 and a retention time window of ±1 s are used. Spearman's correlation coefficient is used, as the relationship between features originating from the same compound is assumed linear. However, this assumption may not hold true if some measured features are close to lower or upper limit of quantification (LLOQ and ULOQ) of the instrument.

Next, an undirected graph of all the connections between the features is generated, where each node represents a feature and each edge represents the corresponding correlation coefficient under the retention time constraint (Figure 12a). The algorithm recursively identifies clusters presumed to reflect the same analyte. In brief, this is achieved using a connectivity criterion, i.e., that the features within a cluster should have strong correlation to a sufficient number of the other features within the cluster. A detailed explanation of the algorithm is beyond the scope of this paper and has been included in the Supplementary Materials (Section 1: Clustering features originating from the same compound) for more advanced (bio) computational scientists.

**Figure 12.** (**a**) An example graph, where every node is a molecular feature and every edge represents a high correlation coefficient and a small retention time difference between the features. (**b**) The graph after the clustering procedure. Each color corresponds to a distinct cluster of features.

After clustering, the feature with the largest median peak area per cluster is retained. All the features that are clustered together are recorded for future reference. Figure 12b shows the state of the graph from Figure 12a after clustering, with each final cluster colored differently.

57. Cluster the molecular features from each analytical mode separately using the algorithm described above. Represent each cluster with the feature with the highest median abundance. Use these features for multivariate analysis and the clustering information for metabolite identification. The algorithm is provided in notame through the cluster\_features() function.

#### *3.3. Data Analysis*

Once the raw data are checked for quality and analytical drift and the features originating from same metabolites merged to reduce the data matrix, the next phase is to utilize data analytical methods to discover the metabolites of biological importance within the taken study set-up. Preferably, a combination of feature-wise and multivariate analyses can be applied (Figure 2). Notame provides an interface for all the statistical tools mentioned in this section (see the statistics vignette for details).

58. Combine the features from the different analytical modes to a single data matrix. In notame, this is achieved with the function merge\_metabosets, which simply concatenates the data matrices and feature information tables row-wise (each row corresponds to a feature) and preserves the sample information unchanged. Note that combining analytical modes inevitably results in increased redundancy in the data matrix, as many compounds are detected in multiple analytical modes. However, combining the analytical modes is necessary so that all available information is available for multivariate analysis methods.

#### 3.3.1. Feature-Wise (Univariate) Analysis

In feature-wise analysis, two types of testing may be used depending on the data: parametric and non-parametric test [45]. The choice of the test statistical depends on the data and the biological questions of the study. Most typically parametric tests are used, but if the features do not satisfy the assumptions of parametric tests, they may be replaced with non-parametric alternatives. Non-parametric methods perform better when dealing with non-normal populations, unequal variances and unequal small sample sizes.


#### 3.3.2. Multivariate Analysis

There are several powerful multivariate tools for analysis of metabolomics data. Dimensionality reduction methods like PCA or t-SNE enable us to explore the data to identify outliers and patterns among samples. Unsupervised clustering methods, such as hierarchical clustering are useful for validating findings from dimensionality reduction methods, as they allow us to observe clustering patterns in high-dimensional space.

Supervised learning techniques, such as partial least squares (PLS) and random forest (RF) are useful for identifying the most interesting molecular features [50,51]. Both the PLS and RF algorithms can be used for both regression and classification purposes. In the case of classification, the PLS model is normally referred to as partial least squares discriminant analysis (PLS-DA). Contrary to the unsupervised methods, supervised methods rely on known outcome or response (e.g., class membership) of each sample and can be used for predictive and descriptive modeling as well as for discriminative variable selection. RF is highly flexible with 3 main advantages over PLS: RF does not assume Gaussian distribution of the variables; RF does not assume linear relationships between response and (latent) predictor variables; Finally, RF is scale invariant, which circumvents issues with scaling and transformations of metabolomics data. On the other hand, it should be noted that PLS can produce stronger models if model assumptions are met. Both PLS and RF offer statistics for evaluating the importance of individual features, such as the variable importance in projection (VIP) values in PLS and Gini index or mean increased error in RF.


**Figure 13.** Modelling error measured as the number of miss-classification during internal cross-validation in MUVR. The overall modelling error (black curve) initially decreases when removing noisy variables until the 'max' model. Further removal of variables until the 'min' model removes redundant features while keeping modeling error almost constant. The 'mid' model represents a compromise between the 'min' and 'max' models and a theoretical optimum model. Light and dark grey lines represent higher level of detail in the validation procedure and we refer to Shi et al. [50] for details.

#### 3.3.3. Ranking and Filtering for Variable Selection

After the completion of both feature-wise and multivariate analysis, results are combined via a ranking method in order to determine the most robust and presumably biologically relevant metabolic features to undergo identification.


71. **Optional Step**: In case the MUVR package is not used for variable selection, the procedure of ranking the molecular features stays the same for any type variable selection is chosen.

#### *3.4. Visualization of Results*

After feature-wise and multivariate analysis, we recommend visualization of patterns of the dataset, both on a feature level and a global level as well as visualization of the *p*-values and effect size measures, to offer a broad view of the results. All the visualizations in this section are provided in notame unless stated otherwise (see the visualizations vignette for details).

#### 3.4.1. Feature-Wise Graphs

While t-SNE figures (Figures 8 and 9) provide a solid overview of the overall patterns in the data, visualizing effects of study factors on a molecular feature level is useful when interpreting the results. The visualization type used depends on study design.

72. If the study has multiple study groups, the differences between groups can be illustrated by beeswarm boxplots separately for each group (Figure 14).

**Figure 14.** Beeswarm boxplots for a molecular feature subdivided into study group.

73. If the study contains samples from multiple time points, the effect of time can be visualized with a line plot using one line per subject together with a thicker line representing the mean at every time point (Figure 15).

If the study contains both multiple groups and multiple time points, consider the following visualizations:

For repeated measures data, plot least square means from the repeated measures model for each study group. You should also add whiskers around the points representing 95% confidence intervals, standard deviation or other measure of variability (Figure 16).

**Figure 15.** The change in the abundance of a molecular feature as a function of time in each subject. The thick red line represents the sample mean.

**Figure 16.** The change in the abundance of a molecular feature as a function of time in each study group. The whiskers depict 95% confidence intervals.

74. Draw a line plot similar to the one in step 73, but color the subject lines according to group and draw separate mean lines for each group (Figure 17a). If the figure gets too cluttered, consider plotting each group separately in small multiples, with a common y-axis (Figure 17b).

**Figure 17.** The change in abundance of a molecular feature between two time points in each subject, colored by group (**a**). Data with time series from multiple groups is easier to read when divided to small multiples (**b**). The bold lines represent group means. Note that the bold mean lines do not necessarily reflect an overall trend present in each subject.

#### 3.4.2. Comprehensive Visualization of Results

Here, we present ways of visualizing results from both feature-wise and multivariate analysis. For illustration, we use a simple case from the RP positive mode of an intervention study, where the samples were taken from two time points, before and after an intervention. For feature-wise analysis, we used a linear model with individual molecular feature as the dependent variable and the time point as the independent variable. We also calculated fold change between the two time points for a scale-free measure of effect size. For multivariate analysis we fit a PLS-DA model predicting the time point from the features.

75. Visualize the patterns in the final dataset using unsupervised dimensionality reduction techniques such as PCA [28] (Figure 18) and t-SNE. If the PCA score plot reveals interesting patterns, use a PCA loadings plot to reveal which features contribute the most to the visualized components.

**Figure 18.** Principal component analysis (PCA) plot of samples from an intervention study, before and after the intervention. The time points are somewhat separated, but no clear clusters or outliers are visible.

76. If PLS(-DA) is used, visualize the samples in a PLS score plot (see Figure 19).

**Figure 19.** Score plot of the first two components of a partial least squares-discriminant analysis (PLS-DA) model trained to predict the time point of samples from an intervention study. The background color indicates the prediction of the model: samples in the blue area are classified to time point "beginning" and samples in the red area to time point "end". Note that the time points are clearly more separated than in the corresponding principal component analysis (PCA) plot (Figure 18). This is to be expected, as PLS-DA finds components that specifically separate the two time points.

77. To visualize overall changes with respect to time in studies with multiple time points, use PCA and t-SNE figures with arrows depicting change in each individual. The arrows should start at the first time point and end at the last time point for each individual. We recommend plotting each study group separately, as the plot can get crowded since the arrows occupy significantly more space than points (Figure 20).

**Figure 20.** Changes in each subject between two time points visualized as arrows between points in a principal component analysis (PCA) plot. Samples in different groups are separated into subplots. While no group shows a systematic direction of change, we can observe that the subjects in group A show greater overall change that subjects in the other groups.

78. Visualize the distribution of *p*-values from feature-wise analysis in a histogram. Use a line to depict the expected uniform distribution (under null hypothesis). If the distribution of the *p*-values deviates from the line as in Figure 21, it can be argued that we are observing a real effect.

**Figure 21.** The distribution of *p*-values from linear models testing the difference in feature abundance between two time points. Since the distribution clearly deviates from the uniform distribution depicted by the red line, it can be argued that there is a true difference between the two time points.

79. Visualize the results of feature-wise tests in a volcano plot. Volcano plots are scatter plots with *p*-values on the y axis and effect size (such as fold change) on the x-axis. Add a horizontal line representing the significance threshold for FDR-adjusted q-values. To co-visualize multivariate results, the features can be colored by their relevance score in the multivariate prediction (Figure 22).

**Figure 22.** A volcano plot of *p*-values (negative log10 scale) from linear models testing the difference of feature abundances between two time points against fold changes between samples taken before and after a dietary intervention (log2 scale). The features are colored by variable importance in projection (VIP)-value from a partial least squares-discriminant analysis (PLS-DA) model trained to separate the two time points. We can observe that the features with the smallest *p*-values tend to have fold changes below 1, indicating that they are less abundant at the end of the intervention. Other metrics of effect size, like Cohen's d values, can also be used in volcano plots.

Manhattan plots are commonly used in genome-wide association studies (GWAS) to visualize the location of the most significant single nucleotide polymorphisms on the genome. Manhattan plots can be applied in metabolomics by using mass-to-charge ratio or retention time on the x-axis. In addition, in cases where direction of effect can be determined, we can multiply the y-axis by the sign of the effect to create so-called directed Manhattan plots. The Manhattan analogy is not lost since the downward points represent the reflection of the skyline on the Hudson River. Note that Manhattan plots should always be drawn separately for each column and ionization mode, as the metabolite classes corresponding to certain *m*/*z* and retention time values depend on the column and ionization mode used.

80. Use a Manhattan plot to connect the results of statistics to biochemical properties of the molecular features. The Manhattan plot should have either retention time or mass-to-charge ratio as the x-axis and –log10(*p*-value) on the y-axis. For a directed Manhattan plot, multiply –log10(*p*-value) by the sign of the effect. The points in the Manhattan plot can be colored by the respective VIP value from PLS-DA or another similar metric. Similar to volcano plots, add a horizontal line to represent the significance threshold for FDR-adjusted q-values. Figure 23a,b show Manhattan plots with mass-to-charge ratio and retention time on the x-axis, respectively.

**Figure 23.** (**a**) A directed Manhattan plot of *p*-values from linear models testing the difference of feature abundances between two time points with mass-to-charge ratio of the features as x-axis. The points are colored by variable importance in projection (VIP)-value from a partial least squares-discriminant analysis (PLS-DA) model trained to separate the two time points. The most interesting groups of molecular features seem to have *m*/*z* ratios around 350 and around 800. Both groups are predominantly lower in the end of the intervention. (**b**) A similar directed Manhattan plot, only with retention time of the features as y-axis. The most interesting groups of molecular features seem to have retention times around 9–10 min and around 11 min. The first group is predominantly lower in the end of the intervention, while the features in the second group have mixed associations.

81. To combine the information of both Manhattan plots, consider a scatter plot with *m*/*z* and retention time on the x- and y-axis, with the size of the point reflecting *p*-value and potentially colored by variable importance from multivariate modelling (e.g., VIP; Figure 24) or by effect size (e.g., fold change; not shown). While size is not an accurate metric in visualizations, this visualization combines mass and retention time so that the most interesting metabolite classes can be identified. As with Manhattan plots, these plots should be drawn separately for each column and ionization mode.

**Figure 24.** Scatter plot of molecular features in *m*/*z* vs retention time space, with the size of the points reflecting *p*-values from linear models testing the difference in feature abundances between two time points. The points are colored by variable importance in projection (VIP)-value from a partial least squares-discriminant analysis (PLS-DA) model trained to separate the two time points. To avoid too many overlapping points, only points with VIP value > 1 are drawn. We can observe that the most interesting group of features has retention times around 9–10 min and *m*/*z* ratios around 350.

We utilize Multiple Experiment Viewer (http://mev.tm4.org/) for *k*-means clustering and hierarchical clustering analyses, which group metabolites into separate clusters or into a hierarchy tree, respectively. Multiple Experiment Viewer is a useful option for post-hoc analysis as it requires no programming expertise. Readers familiar with programming can use other tools for similar results.

The heat maps produced from the analyses can be used to assess the impact of the intervention and the number and proportion of metabolites behaving in a certain manner (Figure 25). We also use the notame R package to produce heat maps of the identified metabolites and their associations with e.g., clinical markers, in which case additional information may be added to each cell, such as the statistical significance with circles, where a larger circle represents a lower *p*-value.

**Figure 25.** Heat map of all the 12,579 molecular features detected in reversed phase negative mode from cereal samples with some of the annotated compounds highlighted. *k*-Means clustering was applied to the dataset, dividing it into distinct clusters (*n* = 13) based on the relative abundance of the features across samples.

82. For the clustering in Multiple Experiment Viewer, first normalize the rows (signal abundances) and select appropriate color scale limits for the normalized abundances (0 to 10% of features can be off limits). For hierarchical clustering, choose whether to cluster only the features or samples as well. Use Pearson correlation and average linkage clustering. For k-means clustering, choose cluster genes, use Pearson correlation, calculate k-means and choose a low number of clusters (e.g., 4) for the initial run. Repeat the procedure by increasing the number of clusters until no more clusters with a unique pattern emerge and choose the highest number of clusters based on this visual optimization.

#### *3.5. Identification of Metabolites*

The identification and annotation of metabolites is a critical step in any metabolomics study to attribute biological meaning to the data analytical results and to enable further hypotheses to be developed for subsequent studies. In recent years, the development of new software and online tools as well as the emergence and expansion of publicly available spectral databases of metabolites have greatly facilitated the identification process [52,53]. Nevertheless, metabolite identification remains perhaps the most time-consuming task where manual curation is necessary and where not all detected molecular features can be identified, leaving knowledge gaps for the interpretation of the results. Alongside with the challenges related to the instrumental differences and matching the obtained MS/MS data to databases, a key bottleneck restricting the level and number of identifications is the lack of reference data for the vast number of metabolites produced by living organisms, estimated up to one million for the plant kingdom [54] and more than 40,000 for humans [55]. Likewise, matching the obtained MS/MS data to existing databases is not straightforward due to differences in experimental conditions used for collecting the reference data. Other limitations may be related to poor quality or lack of mass spectra from metabolites with low abundance in the sample.

We utilize MS-DIAL [18] in the initial semi-automated step of metabolite identification, where the experimental characteristics (exact *m*/*z*, retention time where applicable and MS/MS spectra in CID voltages 10, 20 and 40 V) are compared with those in databases available in NIST MSP format. These databases include MassBank [53], MoNA [56] and others available from the RIKEN Center for Sustainable Resource Science website (http://prime.psc.riken.jp/Metabolomics\_Software/) combined in single files for the positive and negative ionization mode. Additionally, we have included our in-house spectral library in the MSP files. The semi-automated identification process annotates metabolites with similarity score 80% or above, after which the annotations are manually curated by assessing the similarity of the MS/MS spectra and the alternative annotations proposed by the software.

After the curation of the metabolites annotated by MS-DIAL, the remaining unknown metabolites undergo additional searches in databases that are primarily available online, including METLIN [52] for small metabolites and LIPID MAPS [57] for unknown metabolites with RP retention time in the lipid region (> 9 min). Additional attempts to characterize the unknowns are made utilizing MS-FINDER [18], which 1) calculates and scores the possible molecular formulas based on the exact mass and isotopic pattern, 2) searches for compounds corresponding to the likely molecular formulas from non-spectral chemical libraries and 3) compares the experimental MS/MS spectrum of the unknowns with *in silico*-generated MS/MS spectra of the candidate structures.

3.5.1. Comparison with Pure Standard Compounds (MSI Level 1)

	- a. matching *m*/*z* (within 10 ppm or according to instrument mass accuracy);
	- b. similar retention time (ΔRT < 0.2–0.5 min), taking into consideration any possible near-eluting isomers.
	- c. MS/MS spectra (main fragments matching within 0.02 Da in one or more CID voltage)

3.5.2. MS/MS Fragmentation and Database Comparison (MSI levels 2–3)

	- a. matching *m*/*z* (within 10 ppm or according to instrument mass accuracy)
	- b. MS/MS spectra (main fragments matching within 0.02 Da)
	- a. Compare the experimental MS/MS with in-silico generated spectra in MS-FINDER;
	- b. Use the calculated molecular formula, retention time and diagnostic MS/MS fragments to determine the compound class.

Once molecular features are annotated as metabolites, pathway analysis may be conducted to better understand the biological relevance of the metabolites, as well as their involvement in metabolic pathways, e.g., related to intervention effects of disease etiology [1,3]. We consider identification of metabolites until level 2 (putative annotation) to be essential prior to pathway analysis. Of the several pathway analysis tools that are freely available, we use predominantly MetaboAnalyst and Cytoscape. For both tools, conversion of metabolite name to HMDB or KEGG ID that are generally recognizable by the pathway analysis software is essential, since one molecule can have multiple names according to the preference of each research group.


A step-by-step instruction to use the software is listed in the Supplementary Materials (Section 2: Tutorial on Pathway Analyses Tools). It is worth to mention that pathway analysis may not be helpful for lipids, due to i) the limitation of the non-targeted LC–MS metabolomics platform to differentiate the position of the double bonds within the lipid molecule, which impairs the translation of lipid identity to KEGG or HMDB ID and; ii) that most pathway analysis tools would group certain lipid classes that vary greatly based on their fatty acid composition to one node, which may not be biologically meaningful. As an example, phosphatidylcholines with different acyl composition, will be grouped into one node of phosphatidylcholine regardless of the acyl composition, which may not accurately represent acyl transfer in vivo. This gap hence emphasizes the need of pathway analysis tool specialized for lipid molecules.

#### *3.6. Biological Interpretation of the Results*

The analytical procedure described above is aimed to identify metabolites and metabolic pathways that are affected in the chosen study design e.g., differences in circulating metabolites after dietary or other interventions or processing-induced alterations to the phytochemical composition of a certain food. While the described workflow is efficient in elucidating such metabolites, the ultimate value lies in the demonstration of biological significance. The findings need to be related to the scientific context and interpreted in the light of existing biological knowledge. Optimally, findings can be validated e.g., in subsequent studies, where the most interesting/important metabolite species may be chosen for additional analysis, often encompassing development of targeted, quantitative analytical approaches and analyzed in different study populations. An example of such approach is the recent discovery of various trimethylated compounds related to whole grain consumption [62] and the establishment of a quantitative method within another cohort [63].

#### **4. Conclusions**

Non-targeted metabolic profiling analysis employing liquid chromatography and mass spectrometry analysis has proven its usefulness in various fields of natural and medical sciences during the last couple of decades and has greatly improved our capabilities to explore and understand the chemical space in biological samples. Notame workflow encompasses all the essential steps in metabolic profiling studies, from generation of samples to the interpretation of the results and is aimed to serve as a general guideline for setting up and executing metabolomics studies, as well as support users with an in-housed developed R package (notame, version 0.0.1 https://github.com/antonvsdata/notame).

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/10/4/135/s1, Section 1: Clustering features originating from the same compound, Section 2: Tutorial on Pathway Analyses Tools.

**Author Contributions:** M.K. wrote chapters of imputation and multivariate analysis, writing, review and editing the manuscript. A.K. wrote the chapters on quality control, drift correction, feature clustering algorithm, feature-wise analysis and visualizations, is the main author of the notame R package and Wranglr web application and reviewed and edited the manuscript. S.N., M.T. and T.M. wrote the chapter on study design. S.N. and I.Z. described pathway analysis. J.P. conceived the original idea of the statistical and visualization pipeline and contributed to content of the manuscript. V.M.K. wrote the chapter on the identification of metabolites and prepared illustrations; A.F.B. and T.S. wrote the introduction. S.R. and M.R.H. wrote, reviewed and edited the section with LC–MS analysis; K.H. is main responsible for the structure of the workflow and supervision of the scientists, wrote introduction, review and participated in editing; O.K., J.P., D.B. and C.B. participated in writing—Reviewing, editing and supervision. All authors read, reviewed and accepted the final manuscript.

**Funding:** Academy of Finland, Biocenter Finland, EU Horizon 2020, Finnish Cultural Foundation, Lantmännen Research Foundation.

**Acknowledgments:** Development of the R package and Wranglr as well as data analyses were carried out with the support of Bioinformatics Center, University of Eastern Finland, Finland. Miia Reponen is thanked for the preparation of samples and LC–MS operation.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. K.H., O.K., V.M.K., A.K. and J.P. are owners of a spin-off company providing metabolomics services, Afekta Technologies Ltd.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **CFM-ID 3.0: Significantly Improved ESI-MS**/**MS Prediction and Compound Identification**

#### **Yannick Djoumbou-Feunang 1,**†**, Allison Pon 2, Naama Karu 1,**‡**, Jiamin Zheng 1, Carin Li 1, David Arndt 1, Maheswor Gautam 1, Felicity Allen <sup>3</sup> and David S. Wishart 1,4,\***


Received: 4 March 2019; Accepted: 8 April 2019; Published: 13 April 2019

**Abstract:** Metabolite identification for untargeted metabolomics is often hampered by the lack of experimentally collected reference spectra from tandem mass spectrometry (MS/MS). To circumvent this problem, Competitive Fragmentation Modeling-ID (CFM-ID) was developed to accurately predict electrospray ionization-MS/MS (ESI-MS/MS) spectra from chemical structures and to aid in compound identification via MS/MS spectral matching. While earlier versions of CFM-ID performed very well, CFM-ID's performance for predicting the MS/MS spectra of certain classes of compounds, including many lipids, was quite poor. Furthermore, CFM-ID's compound identification capabilities were limited because it did not use experimentally available MS/MS spectra nor did it exploit metadata in its spectral matching algorithm. Here, we describe significant improvements to CFM-ID's performance and speed. These include (1) the implementation of a rule-based fragmentation approach for lipid MS/MS spectral prediction, which greatly improves the speed and accuracy of CFM-ID; (2) the inclusion of experimental MS/MS spectra and other metadata to enhance CFM-ID's compound identification abilities; (3) the development of new scoring functions that improves CFM-ID's accuracy by 21.1%; and (4) the implementation of a chemical classification algorithm that correctly classifies unknown chemicals (based on their MS/MS spectra) in >80% of the cases. This improved version called CFM-ID 3.0 is freely available as a web server. Its source code is also accessible online.

**Keywords:** mass spectrometry; liquid chromatography; MS spectral prediction; metabolite identification; structure-based chemical classification; rule-based fragmentation; combinatorial fragmentation

#### **1. Introduction**

Liquid chromatography (LC) coupled to mass spectrometry (MS) or tandem mass spectrometry (MS/MS) has become one of the leading techniques for compound identification in organic chemistry, natural product chemistry, and metabolomics [1,2]. In the field of metabolomics, LC-MS/MS is widely used to identify and quantify individual chemicals in complex biological or environmental mixtures. For untargeted MS-based metabolomics, high performance or ultrahigh performance liquid chromatography (HPLC or UHPLC) is first performed to separate compounds in the sample and then electrospray ionization (ESI) mass spectrometry (MS and MS/MS) is used to collect the mass spectra of

each chromatographic peak. In order to identify individual compounds, the resulting MS/MS spectra, along with the chromatographic retention time and parent ion masses of the compound of interest, are then (ideally) compared to the MS/MS spectra and retention time of authentic standards to confirm the compound's identity.

Because of the limited availability of many authentic chemical standards in most metabolomics labs, putative metabolite identification is more commonly performed [3]. Putative identification (MSI level 2) is achieved by comparing the MS/MS spectra to experimentally collected reference spectra found in various MS/MS spectral databases. Key to the success of this putative identification process is the availability of a large, comprehensive database containing experimentally collected MS/MS spectra of pure compounds that covers a large portion of "chemical space". Unfortunately, publicly available databases of experimental MS/MS spectra currently cover a total of only ~20,000 unique compounds [4]. Consequently, as reported in many large-scale metabolomic studies [5,6], the percentage of MS spectral features that can be confidently assigned to known compounds is often less than 2%. As a result, the compound identification step continues to be the central bottleneck in almost all untargeted MS-based metabolomic studies.

Given the cost of synthesizing or acquiring the 100,000's of chemicals needed to create the required experimental MS/MS spectral libraries, a growing number of scientists are turning to in silico metabolomic methods to facilitate compound identification. Over the last decade, a number of computational MS approaches have been developed for this purpose. Some of the more popular software tools use MS/MS fragmentation trees and spectral fingerprints (e.g., CSI:FingerID [7]) of an observed ESI-MS/MS spectrum to rank the likelihood that a given chemical structure could produce such a spectrum. Other tools arrange substructures of a candidate molecule into a format that best explains the fragmentation pattern observed in a given experimental MSn spectral tree (MAGMA [8]). Still others, such as MetFrag [9] use in silico fragmentation of candidate molecules, based on a given mass spectrum and mass of a precursor molecule to identify its structure. Competitive Fragmentation Modeling-ID (CFM-ID) [10–12] use in silico fragmentation techniques to predict ESI-MS/MS (for LC-MS) or EI-MS (for GC-MS) spectra for a given structure. By matching the observed MS/MS spectrum to a library of predicted MS/MS spectra, it is possible to identify or rank which compound is being observed. It is widely believed that increasing the number (and accuracy) of in silico-predicted spectra should increase the likelihood of successfully identifying compounds from newly acquired MS/MS spectra [13].

The two main in silico MS fragmentation techniques are rule-based approaches and combinatorial approaches. Rule-based "fragmenters" use hand-made rules based on experimentally observed fragmentation patterns that are specific to one or more structural features or chemical classes. These rules are typically extracted from analyzing the scientific literature or, preferably, learned from in-house experimental data. Mass Frontier™ (ThermoFisher, CA; HighChem, Bratislava, Slovakia) is an example of a software tool that uses hand-made fragmentation rules. Once the rules are implemented, this approach can be very fast, consistent, and accurate. However, a major disadvantage to this approach is that the design of fragmentation rules requires considerable expert curation. Furthermore, these rules cannot be applied to novel classes of molecules. For these reasons, much more emphasis has recently been put toward the implementation of computational combinatorial fragmentation approaches. Combinatorial fragmentation approaches iteratively cleave chemical bonds within a molecule in a combinatorial fashion, and use (or learn) penalty scores that favor the cleavage events that are most likely to occur at each step. Examples of tools that implement combinatorial fragmentation include CFM-ID [10–12], MetFrag [9], and FiD [14].

CFM-ID is a publicly available software tool and web server that can be used for MS/MS spectral prediction, MS/MS spectrum peak assignment, as well as MS-based compound identification. It implements a technique known as Competitive Fragmentation Modeling (CFM). CFM is a probabilistic generative modeling method that uses a customized cost function to take into account the structural composition of a molecule to predict its electrospray ESI-MS/MS spectrum. The original version of

CFM-ID was used to generate a reference MS/MS spectral library of over 51,000 known compounds from the HMDB [15] and KEGG [16] databases at 3 different collision energies (10, 20, and 40 eV). This spectral library was used by CFM-ID (version 1.0 and version 2.0) to suggest candidate molecules that match input experimental MS/MS spectra. In 2015, the original version of CFM-ID was shown to outperform FingerID and an earlier version of MetFrag in various identification tasks from ESI-MS/MS spectra [11]. However, subsequent tests and studies that further assessed the performance of CFM-ID have shown that a number of improvements could be made to the program and its spectral database [11,12].

For instance, one well-known limitation of CFM-ID is its very slow and relatively poor performance for predicting MS/MS spectra of lipids and other large "segmented" or modular metabolites. This is primarily due to the length of the fatty acids or attached head-group segments, leading to a combinatorial explosion of the possible fragments at each step of the in silico fragmentation process. As demonstrated by Kind et al. [17], who developed LipidBlast, and Tsugawa et al. [18], who studied sphingolipid fragmentation, the use of structure-based fragmentation rules appears to be much better at handling lipids and other large segmented or modular molecules (such as carbohydrates) than combinatorial fragmentation. However, it is important to note that LipidBlast also has some limitations. For instance, it does not provide a well-defined set of fragmentation rules or algorithms that can be incorporated into other computational MS spectral prediction tools. Furthermore, while LipidBlast does provide *m*/*z* values and heuristically modeled static relative abundances for fragment ions, the annotation of fragment ion peaks is limited to formulas and does not include actual structures. Moreover, LipidBlast predict consensus mass spectra, and does not distinguish between different collision energies. These are the kinds of output that are typically found with most in silico fragmenters, and these shortcomings have been addressed in this update to CFM-ID.

In addition to the incorporation of compound-specific fragmentation rules, it has also been shown that significant improvements in MS-based compound identification can be achieved by including metadata or other forms of external data in the spectral matching and scoring functions [9]. In particular, the inclusion of citation frequency (the number of times a given compound is mentioned in the literature), along with the incorporation of experimentally collected MS/MS spectra in the reference spectral database can often improve compound identification performance by a factor of 2 or more [19]. When taking into account the chemical similarity or the distribution of structural features or chemical classes (via ClassyFire [20]) among candidates, it is often possible to improve the performance even further [7]. Based on these and other developments in the field of in silico metabolomics and in silico mass spectrometry, we have implemented a number of modifications to CFM-ID. These modifications have helped in a number of important ways, including (1) achieving faster and more accurate prediction of MS/MS-spectra for 21 classes of lipids, (2) enabling the expansion of CFM-ID's reference spectral library to include both experimental and predicted MS/MS spectra, (3) enhancing CFM-ID's ability to incorporate metadata and chemical similarity, (4) improving CFM-ID's compound identification rates, and (5) enhancing CFM-ID's ability to predict the structural class of compounds for query spectra that could not be matched in CFM-ID's spectral database. The improved version of CFM-ID is called CFM-ID 3.0. It is freely available as a web server at http://cfmid3.wishartlab.com. Its source code is accessible at https://sourceforge.net/p/cfm-id/wiki/Home (combinatorial fragmentation tool) and https://bitbucket.org/wishartlab/msrb-fragmenter/ (rule-based fragmentation tool).

#### **2. Results**

#### *2.1. Encoding Lipid Fragmentation Rules*

Our manual analysis of experimentally acquired lipid MS/MS spectra provided a basis for the generation of 344 unique fragmentation rules covering 21 lipid classes and 7 adducts, for a total of 50 combinations of lipid classes and adduct types. Each rule describes a chemical reaction that fragments the precursor molecule to generate a specific fragment. The structure and mass-to-charge ratio of the fragment can be easily computed based on the encoded pattern. For each lipid class, an ESI-MS/MS spectrum can be simulated by CFM-ID 3.0 at collision energies of 10, 20, and 40 eV. In general, almost all ESI-MS/MS spectra of lipids show similar fragmentation patterns with characteristic losses of the polar head group, and the acyl or alkyl chains, with relatively little fragmentation within the acyl or alkyl chains. For example, in choline-containing glycerophospholipids, the most commonly observed fragments include phosphocholine (C5H14NO4P+ ion; neutral mass = 184.0733 Da) and the cyclic 1,2-cyclic phosphate diester (C2H6O4P+ ion; neutral mass = 125.0003 Da). Figure 1 illustrates consensus fragmentation patterns for phosphatidylcholines from their [M+H]<sup>+</sup> precursor ions. The number of rules for each lipid class and the number of covered adduct types per lipid class are shown in Table 1. These fragmentation rules are also available at https://bitbucket.org/wishartlab/msrb-fragmenter/.

**Figure 1.** Fragmentation patterns of phosphatidylcholines obtained from their [M+H]<sup>+</sup> precursor ions. Among all resulting fragments, only the precursor ion is observed at each of the three energy levels. The ion fragment C5H14NO4P+ (red arrow) corresponding to phosphocholine is observed at 20 and 40 eV, and the remaining fragments were observed only at 40 eV.


**Table 1.** Number of fragmentation rules and adduct types covered for each chemical category.

#### *2.2. The New CFM-ID 3.0 Spectral Library*

The original CFM-ID 2.0 spectral library contained 102,153 unique computationally generated ESI-MS/MS spectra (from 51,635 compounds). Because of improvements in the spectral prediction performance, additions of new compounds, and the decision to add experimental spectra, the new CFM-ID 3.0 spectral library has been expanded by a factor of 2.6 over the original CFM-ID 2.0 spectral library (as of February 2019). In particular, the new library now contains a total of 167,547 computationally generated ESI-MS/MS spectra (generated via CFM-ID 3.0) from 108,972 compounds in the HMDB [15]; 22,914 computationally generated ESI-MS/MS spectra from 11,685 compounds in KEGG [16]; and 83,049 experimentally collected ESI-MS/MS spectra from 21,904 compounds. The compounds in CFM-ID 3.0's experimental MS/MS spectral library are structurally and functionally diverse, and originate from various databases/libraries including HMDB (human metabolites) [15], DrugBank (drugs and drug metabolite) [21], KEGG (metabolites and drugs) [16], PhytoHub (dietary phytochemicals and their metabolites) [22], GNPS (natural products) [23], and MoNA [24]. In addition, 568 spectra from the CASMI 2014 [25] and CASMI 2016 [19] challenges were imported into the database. Moreover, 3953 spectra that were experimentally acquired at the Metabolomics Innovation Center (TMIC, Edmonton, AB, Canada) were also added. Each of the 229,084 compounds in the new spectral library was assigned a citation score (described below) that is used as metadata in compound identification tasks. Each compound in the spectral library has two or more citations. A summary of the library's statistics is displayed in Table 2.


**Table 2.** Statistics for the Competitive Fragmentation Modeling-ID (CFM-ID) 3.0 spectral database.

In our effort to improve the identification rates, a full chemical classification was computed for all 229,084 unique compounds using the computational chemical classifier called ClassyFire [20]. An average of ~25 chemical categories were assigned per compound. The chemical classification was used to adjust CFM-ID's original scoring system, to take into account the chemical composition and chemical similarity among candidate molecules. This compound classification process also served as a basis to predict the chemical class of any new compound corresponding to the query spectrum in identification tasks.

#### *2.3. Performance Testing*

#### 2.3.1. Lipid ESI-MS/MS Spectral Prediction

Two tests were performed to assess CFM-ID 3.0's lipid spectral prediction performance. One was for speed while the other was for accuracy. In terms of speed, CFM-ID 3.0 averaged 0.395 ± 0.03 s of computation time to predict each of the 120 lipid ESI-MS/MS spectra in the test set, while CFM-ID 2.0 averaged 68.58 ± 0.21 s for the same task. This represents a speed-up of 173.6X. Clearly, the rule-based approach for lipid analysis used in CFM-ID 3.0 is significantly faster than the combinatorial approach used in CFM-ID 2.0. For most other kinds of molecules, the average processing time for CFM-ID (3.0 and 2.0) is about 23.75 ± 0.2 s. Clearly, the computational slow-down for lipid spectral calculation (due to the many potential fragmentation combinations) is quite significant, which largely motivated us to develop a faster rule-based approach.

In terms of spectral prediction performance, the average spectral similarity score (measured using the dot product) between the experimental lipid ESI-MS/MS spectra (collected on a QTOF at multiple collision energies) and the CFM-ID 3.0-predicted ESI-MS/MS spectra was 0.85 ± 0.2. On the other hand, the average spectral similarity score between the CFM-ID 2.0-predicted ESI-MS/MS spectra and the experimental ESI-MS/MS spectra was 0.09 ± 0.1. This suggests that the accuracy of CFM-ID 3.0 for lipid spectral prediction is 11X better than that of CFM-ID 2.0, which is highly significant. It is worth mentioning that CFM-ID predicts ESI-MS/MS spectra at three different collision energies while other programs, such as LipidBlast, generate a consensus MS/MS spectrum that essentially merges the MS/MS spectra over all collision energies. Therefore, during our comparative analysis, only one LipidBlast-generated consensus ESI-MS/MS spectrum was used for each unique compound and compared against the experimental spectrum, independent of the energy level. The average spectral similarity score between the LipidBlast-predicted ESI-MS/MS spectra and the experimental ESI-MS/MS spectra was 0.34 ± 0.4. Figure 2 shows head-to-tail plots comparing the experimental ESI-MS/MS spectrum of dipalmitoyl phosphatidylcholine (PC(16:0/16:0)) collected at a 40 eV collision energy with the corresponding in silico spectra predicted with CFM-ID 2.0 [11] (Figure 2a), CFM-ID 3.0 (Figure 2b), and LipidBlast [17] (Figure 2c), respectively. The experimental spectrum was measured in positive ion mode ([M+H]+), with a collision energy of 40 eV.

**Figure 2.** *Cont*.

**Figure 2.** Head-to-tail plot of experimental and predicted electrospray ionization-tandem mass spectroscopy (ESI-MS/MS) spectra of PC(16:0/16:0). (**a**) Head-to-tail plot showing an experimental ESI-MS/MS spectrum of dipalmitoyl phosphatidylcholine (PC(16:0/16:0)) measured at 40 eV, and the matching ESI-MS/MS spectrum predicted by CFM-ID 2.0. The computed spectral similarity score is 0.07. (**b**) Head-to-tail plot showing an experimental of ESI-MS/MS spectrum of dipalmitoyl phosphatidylcholine measured in positive ion mode ([M+H]+) at 40 eV, and the matching ESI-MS/MS spectrum predicted by CFM-ID 3.0. The computed spectral similarity score is 0.88. (**c**) Head-to-tail plot showing an experimental of ESI-MS/MS spectrum of dipalmitoyl phosphatidylcholine measured in positive ion mode ([M+H]+) at 40 eV, and the matching ESI-MS/MS spectrum predicted by LipidBlast. The computed spectral similarity score is 0.13.

As seen in Figure 2, the spectral similarity between the CFM-ID 2.0-generated spectrum and the experimental ESI-MS/MS spectrum was 0.07, with CFM-ID 2.0 being able to predict only two fragments that were observed in the experimental spectrum (namely, the C5H12N+ and C5H14NO4P+ ion fragments). For this particular example, CFM-ID 2.0 predicted 31 fragments (Figure 2a) while CFM-ID 3.0 predicted 10 fragments (Figure 2b), 7 of which were observed in the experimental ESI-MS/MS spectrum. It is worth noting that the remaining three fragments result from fragmentations that were observed in experimentally measured ESI-MS/MS spectra of phosphatidylcholines obtained for [M+H]<sup>+</sup> adducts at 40 eV. For this example, the spectral similarity score was 0.88 when comparing the experimental ESI-MS/MS spectrum with the CFM-ID 3.0-predicted spectrum. Surprisingly, the dot product score was only 0.13 when compared with the LipidBlast-predicted ESI-MS/MS spectrum to the experimental ESI-MS/MS spectrum. Figure 3 shows comparisons between experimental and predicted ESI-MS/MS spectra for 1-palmitoyl-2-oleoyl-sn-glycero-3-phospho-L-serine (PS(16:0/18:1(9Z))) in the negative ([M−H]−) ion mode at a collision energy of 40 eV. The measured spectral similarity scores between the experimental and the in silico-generated spectra are 0.10, 0.92, and 0.91 with CFM-ID 2.0 (Figure 3a), CFM-ID 3.0 (Figure 3b), and LipidBlast (Figure 3c), respectively.

**Figure 3.** *Cont*.

**Figure 3.** Head-to-tail plot of experimental and predicted ESI-MS/MS spectra of (PS(16:0/18:1(9Z))). (**a**) Head-to-tail plot showing an experimental of ESI-MS/MS spectrum of 1-palmitoyl-2-oleoylsn-glycero-3-phospho-L-serine (PS(16:0/18:1(9Z))) measured at 40 eV, and the matching ESI-MS/MS spectrum predicted by CFM-ID 2.0. The computed spectral similarity score is 0.10. (**b**) Head-to-tail plot showing an experimental ESI-MS/MS spectrum of 1-palmitoyl-2-oleoyl-sn- glycero-3-phospho-L-serine (PS(16:0/18:1(9Z))) measured at 40 eV, and the matching ESI-MS/MS spectrum predicted by CFM-ID 3.0. The computed similarity score is 0.92. (**c**) Head-to-tail plot showing an experimental ESI-MS/MS spectrum of 1-palmitoyl-2-oleoyl-sn-glycero-3-phospho-L-serine (PS(16:0/18:1(9Z))) measured at 40 eV, and the matching ESI-MS/MS spectrum predicted by LipidBlast. The computed similarity score is 0.91.

As highlighted in Table 3, CFM-ID 3.0 significantly outperforms CFM-ID 2.0 in terms of lipid spectral prediction performance (average score of 0.85 versus 0.09) and CFM-ID 3.0 generally outperforms LipidBlast (average score of 0.85 versus 0.34). Another important advantage of CFM-ID 3.0 over LipidBlast is the fact that it generates spectral predictions for multiple collision energies (10, 20, and 40 eV) whereas LipidBlast only provides a single spectrum at an undefined collision energy. Furthermore, all spectral predictions generated by CFM-ID 3.0 include information about not only the *m*/*z* values and their relative intensities but also the structure of the predicted fragments (expressed as InChI and SMILES strings) for every predicted peak. LipidBlast only provides the *m*/*z* values and intensities.

**Table 3.** Computed spectral similarity scores between experimental and predicted ESI-MS/MS spectra at three energy levels (10, 20, and 40 eV). The results show higher similarities, and thus an improvement when using a rule-based approach (CFM-ID 3.0) over a combinatorial one (CFM-ID 2.0) for the prediction of lipid ESI-MS/MS spectra. The spectral similarities of the LipidBlast-generated consensus spectra further illustrate this trend. When available, the same LipidBlast-generated consensus spectrum was used for comparisons at each energy level. N/A corresponds to cases where (1) the adduct type was not covered by CFM-ID 2.0 at all, or (2) the adduct type was not covered by LipidBlast for the chemical class to which the test compound belongs.


#### 2.3.2. Compound Identification Using the New Scoring Functions

A set of 1,000 compounds was used to train a new and improved scoring function for ESI-MS/MS-based compound identification (see Section 4.5). This function was developed in order to optimize CFM-ID 3.0's compound identification performance using spectral matching scores, compound classification information from high-scoring hits, and compound metadata (citations). The models were obtained using 5X cross-validation, and tested on different sets. Table 4 compares the performance of CFM-ID 3.0 versus CFM-ID 2.0 (2016 and 2019) and MS-FINDER [26] for compound identification based on 208 ESI-MS/MS spectra from 185 unique compounds. The test involving CFM-ID 2.0 for 2016 and 2019 used the same algorithm and scoring functions. However, the 2019 version mentioned here uses the expanded spectral library, compared to the in silico spectral library of 2016. The test spectra correspond to those provided for the CASMI 2016 contest (category 3). To establish a baseline and ensure the spectral similarity matching method worked, we first queried the 208 experimental spectra against the full spectral database (with those same 208 spectra included). In this case, both CFM-ID 2.0-2019 and CFM-ID 3.0 correctly identified all 208 query compounds with perfect spectral similarity scores. The 208 experimental spectra were then removed from the CFM-ID 3.0's spectral library and the queries were run again. As illustrated in Table 4, CFM-ID 3.0 was able to correctly identify the query compound in 149 out of 208 challenges, compared to only 123 by CFM-ID 2.0-2019 or 120 by CFM-ID 2.0-2016. This represents an improvement of 21.1%

over CFM-ID 2.0. The query compound was generally ranked higher (average rank = 1.8) by CFM-ID 3.0 compared to CFM-ID 2.0-2019 (average rank = 2.4). CFM-ID 3.0 also achieved a better medal score (848) compared to CFM-ID 2.0-2019 (718). A medal score is calculated as the sum of weighted top 1 ranks with 5 points (gold medal), top 2 ranks with 3 points (silver), and top 3 ranks (bronze) with 1 [19]. CFM-ID 2.0-2016 and MS-FINDER [26] were also evaluated in the CASMI 2016 contest (category 3). As the original winner of the CASMI 2016 contest, MS-FINDER correctly identified the query compound in 146 cases [19]. However, as shown in Table 4, MS-FINDER scored 20% fewer top 3 hits compared to CFM-ID 3.0. Moreover, MS-FINDER achieved a lower medal score (766 versus 848), and had a much lower average "hit" rank (6.4 versus 1.8) compared to CFM-ID 3.0. It is also worth noting that CFM-ID 2.0-2016 scored the lowest in terms of top 1 hits, had the lowest average "hit" rank (13.6), and the lowest medal score (just 600).

**Table 4.** Comparison of CFM-ID 3.0, CFM-ID 2.0, and MS-FINDER scoring functions upon identification of 185 compounds from 208 ESI-MS/MS spectra. Reported are the total number of challenges in which the corresponding implementation of the scoring function ranked the query compound in the top 1, top 3, and top 10. The average and median ranks for the query compound are also reported. A chemical classification is assessed as correct if the predicted category matches a category originally assigned by ClassyFire. N/A, not applicable; \* performance when applied over the expanded spectral library database including the 208 experimental ESI-MS/MS from the CASMI 2016 contest (category 3).


#### 2.3.3. Compound Chemical Classification

For this assessment, the 208 challenge MS/MS spectra (corresponding to 185 distinct compounds) from the 2016 CASMI competition were used as queries. CFM-ID 3.0 was used to predict the chemical class of the query compound with the predicted class being the direct parent (according to ClassyFire [20]) of the highest-ranked compound. The direct parent is the parental or broader chemical class in the ClassyFire hierarchy to which a compound belongs. It typically corresponds to the largest identifiable chemical skeleton or most dominant feature of the classified compound [20]. In case of a tie, the predicted class was identified as the most frequently occurring chemical class among the direct and alternative parents of all compounds with the highest score. Alternative parents are categories in the ClassyFire ontology that describe the classified compound and do not display a parent–child relationship to each other or to the direct parent [20]. When using ESI-MS/MS spectra as input, CFM-ID 3.0 correctly predicted the chemical class in 168 out of 208 challenges. Interestingly, in 19 out of the 168 challenges, the corresponding query compound could not be correctly identified. These results suggest that CFM-ID 3.0 was still able to capture key structural features that characterize the fragmentations observed in the corresponding input MS/MS spectra. These findings also demonstrate the importance of using a diverse set of compounds and MS/MS spectra to assist with compound identification or classification. In particular, they highlight the need for large compound/spectral databases for proper compound identification.

#### **3. Discussion**

#### *3.1. ESI-MS*/*MS Lipid Spectra Prediction*

The much better performance for lipid spectra prediction via rule-based fragmentation approaches (CFM-ID 3.0) relative to combinatorial fragmentation approaches (CFM-ID 2.0) is likely due to two factors. First, lipids are modular molecules and so the MS fragmentation patterns seen under most collision energies are easily understood and relatively simple to describe. On the other hand, combinatorial fragmenters have no knowledge of molecular structure and so they cannot recognize modular structures. Instead, they view lipids as molecules with dozens of breakable bonds, all of which could potentially be fragmented. This leads to a substantial over-prediction of MS peaks. The second reason why combinatorial fragmenters do not perform well on lipids is that they have generally not been "trained" on lipid spectra. For example, CFM-ID 2.0 was only trained on ~1000 experimental MS/MS spectra, none of which included lipid MS/MS spectra. Similarly, MetFrag [9], another combinatorial fragmenter, was also not originally programmed to handle lipid MS/MS spectra (although a later version was [27]). By expanding CFM-ID's training set and including lipid spectra as well as other modular compound classes in that training set, CFM-ID could potentially improve its performance to match even the rule-based fragmenter.

Overall, our results show that CFM-ID 3.0 was able to reproduce most lipid fragments with accurate *m*/*z* ratios and reasonably accurate relative intensities. Characteristic fragment ion losses (e.g., loss of polar head, or side chains) were also well reproduced. CFM-ID 3.0's spectral predictions also include many ion fragments that are independent of the acyl or alkyl chain(s) of the molecular ion, including the cyclic 1,2-cyclic phosphate diester (neutral *m*/*z* = 125.0003) fragment, which is often observed in ESI-MS/MS spectra of various choline glycerophospholipids. Interestingly, most of these kinds of fragments are not reported in LipidBlast-generated spectra. On average, the spectral similarity score between experimental ESI-MS/MS spectra and LipidBlast-generated spectra was 0.34 ± 0.4. One of the reasons for the lower similarity scores for LipidBlast has to do with the fact that it generates only one consensus spectrum per compound, which tends to more closely match with experimental ESI-MS/MS spectra collected at 40 eV. As a result, the average spectral similarities for LipidBlast-generated spectra to experimental spectra obtained at 10 and 20 eV are very low.

As expected, some discrepancies were observed when comparing predicted MS/MS spectra with the corresponding experimental MS/MS spectra. First, the relative peak intensities were generally found to be higher in the predicted MS/MS spectra than the experimental spectra. Second, the peak lists are often not identical. MS/MS spectral peak intensities are very difficult to predict and vary considerably depending on the instrument, the instrument parameters, and experimental design. For instance, phosphatidylcholines, when analyzed by Q-TOF instruments, tend to lose the molecular ion even at medium collision energies. On the other hand, when phosphatidylcholines were analyzed on ion-trap MS instruments, the molecular ion was still highly abundant at medium collision energies, and was significantly fragmented only at high energies [28–30]. In addition to instrument differences, the type of solvent being used can affect the extent to which a compound is fragmented. However, rather than focusing on these subtleties, we chose to focus on selecting (and annotating) the most abundant or most characteristic fragments, which were generally reproducible on different instruments, and reported in multiple studies.

While CFM-ID 2.0 predicts fragmentation probabilities and numeric peak intensities for all query compounds, CFM-ID 3.0 does not predict numeric peak intensities for lipid spectra (however, it still predicts numeric peak intensities for all other classes of molecules). Instead, CFM-ID 3.0 predicts categorical peak intensities for lipid spectra (low, medium, high, and maximum abundance). This simple categorization partly explains why, in many cases, the relative peak intensity is higher in predicted lipid spectra compared to experimental spectra. We believe that a larger lipid MS spectral training set (at least 10+ spectra per chemical class and adduct type) would help to improve the prediction of numeric intensities and simulate their variation between collision energies more accurately. Another limitation of CFM-ID 3.0's rule-based approach for lipid prediction is that the current fragmentation rules do not take the information about the stereochemistry and the position of double/triple bonds into consideration. Therefore, the existing rules do not allow one to distinguish between stereoisomers or regiomers. This is a common problem for rule-based "fragmenters", since the incorporation of such distinctions would require the acquisition of a much more diverse and larger set of high-resolution MSn spectra.

CFM-ID 3.0 returns the structure (in InChI or SMILES strings) for all predicted fragments. This helps to provide a rationale for nearly all observed peaks. Additionally, this linkage simplifies the lipid ESI-MS/MS spectral annotation process. Because CFM-ID 3.0 provides MS/MS spectra at three energy levels (10, 20, and 40 eV), it means that the predicted MS/MS spectra can be matched more closely to real experimental conditions and real experimental MS/MS spectra. Many other spectral libraries (LipidBlast, NIST) only provide consensus MS/MS spectra for lipids, which makes it difficult to relate experimental data to the predictions.

#### *3.2. Compound Identification and Chemical Class Prediction*

The incorporation of citation counts in MS-based compound identification protocols has been consistently shown to improve identification rates in recent studies on spectral/compound identification [19,31]. However, an obvious limitation of this approach is that it reduces the probability of identifying novel or rare compounds that have never been cited. Citation counts can also bias the ranking scheme to select one very similar structure (and therefore a very similar MS spectrum) over another purely on the basis of one having slightly more citations than another. To help balance the influence of citation counts, we incorporated chemical classification into our new scoring system. In this way, the scientific relevance or approximate abundance (in terms of citations) as well as the structural features among candidates could be taken into consideration. Using this approach, a new scoring function was developed for compound identification in CFM-ID 3.0. This new function helped to improve MS-based compound identification quite significantly (see Table 4). When applied to 208 identification challenges on a CFM-ID spectral library containing the 208 ESI-MS/MS spectra, both CFM-ID 2.0 and CFM-ID 3.0 were able to identify all 208 compounds based on spectral similarity. However, as pointed out earlier, it can be expected that most spectral libraries, including CFM-ID 3.0's, will not include (the same) experimental spectra corresponding to a compound of interest. Thus, the use of metadata (i.e., citations) in addition to spectral similarity can help improve identification rates. In particular, when applied to 208 identification challenges, CFM-ID 3.0's ESI-MS/MS scoring function achieved an improvement in overall ranking and identification rate (21.1%) over CFM-ID 2.0's original scoring function. We believe the use of diverse training sets of compounds, representing widely varying structures and structural classes was critical to achieving this improved performance. Our work in this area also confirmed the notion that more work needs to go into expanding spectral databases with experimental data and that this will ultimately improve spectral/compound identification performance.

CFM-ID 3.0 was also assessed with regard to its performance in chemical class prediction. While it may not be possible to identify the exact compound via MS/MS spectral matching, the ability to use MS/MS spectra to narrow down the correct chemical class or chemical family for a given query spectrum or compound can be very valuable for many applications in metabolomics or natural product de-replication. In assessing the performance of CFM-ID's chemical class prediction, the same scoring system introduced here was used to rank the individual candidates. However, in order to perform a formal chemical class identification, the query compound was predicted to belong to the "direct parent" class of the highest-ranked candidate. In cases of a tie, the predicted chemical class was predicted to be the most frequently occurring among all the direct and alternative parents among all the compounds with the highest score. Upon testing the new ESI-MS/MS scoring function on 208 challenges, the correct class was predicted in 80.8% of the challenges (compared to 71.6% for correct compound identification). This result indicates that even when a compound was not identified correctly, the correct class prediction could still be made (in 19 cases). This suggests that CFM-ID 3.0 is still able to identify structural features that characterize the MS/MS fragmentation patterns of certain classes of compounds. These results also demonstrate the importance of using a diverse set of compounds and spectra, as well as the need of having a sufficiently large database to enable compound (or compound class) identification via spectral matching. Structurally similar compounds tend to produce similar spectra. Therefore, even if the compound (and its corresponding MS/MS spectrum) is not in the database (or is poorly ranked), a large number of compounds/spectra from related compound

classes could still help to identify the correct compound class. We believe that this helped CFM-ID 3.0 achieve its relatively good performance in the class prediction task.

The inclusion of additional data (citation frequency and chemical class information) in the CFM-ID scoring functions clearly improved compound identification performance. We also believe these improvements were partially aided by the much-improved quality of lipid MS/MS spectra predicted by CFM-ID 3.0. While we made substantive improvements to the quality of CFM-ID's lipid spectra prediction, more work still needs to be done in CFM-ID to better predict the MS/MS fragmentation of other classes of compounds (such as alkaloids, polyphenols, terpenes, and steroids) and to increase the quality of other predicted MS/MS spectra. The addition of many more experimental ESI-MS/MS spectra, measured with various MS instruments, and under different conditions, is also expected to help capture spectral patterns that are not yet accurately predicted by CFM-ID's algorithm, and thus, improve its overall compound identification performance.

#### **4. Materials and Methods**

To improve CFM-ID's overall performance for MS/MS analysis, we pursued several algorithmic and database enhancements such as (1) encoding and validating rules for ESI-induced fragmentation of 21 classes of lipids; (2) implementing an automated chemical classification schema (via ClassyFire) for both CFM-ID's database and its query compounds; (3) redesigning, expanding, and improving CFM-ID's MS/MS spectral library (by including experimental MS/MS spectra and adding many thousands of predicted ESI-MS/MS spectra, as well as metadata); (4) collecting citation information from various sources for all of the compounds in CFM-ID's MS/MS spectral library; and (5) modifying CFM-ID's scoring function to incorporate the above changes and improve its overall performance.

The encoding of the lipid rule-based fragmentation approaches was added to improve the speed and accuracy of CFM-ID's lipid ESI-MS/MS predictions, as well as to cover a larger pool of experimental conditions as reflected by different adduct types. The use of ClassyFire's chemical classification method [20] was implemented to automate the rule-based/combinatorial-based decisions for CFM-ID and to improve CFM-ID's ability to identify or re-rank potential MS/MS spectral matches based on structural similarity. The redesign and expansion of the CFM-ID's spectral database was performed to accelerate search speeds, reduce the memory requirements, and to grow the spectral database size (of both predicted and known MS/MS spectra) by a factor of 2.6 so as to improve the likelihood of user query spectral matches. The inclusion of citation data was intended to enhance the scoring accuracy of potential MS/MS spectral matches, while the modification of CFM-ID's scoring function was intended to improve its overall performance. Details regarding how all of these changes were implemented are described below.

#### *4.1. Encoding Lipid Fragmentation Rules*

Our analysis of numerous databases and the literature indicated that there are 21 major classes of lipids for which MS/MS spectra are best predicted using hand-made fragmentation rules. The encoding of these hand-made lipid fragmentation rules involved several steps: (1) experimentally measuring or compiling (via literature) characteristic MS/MS fragment ions observed at each of three collision energy levels (10, 20, and 40 eV) for each lipid class, (2) determining the relative abundance of each fragment ion at each energy level, (3) accurately determining the chemical structure and *m*/*z* values of each of the fragment ions, (4) including more MS/MS experimental conditions (and adduct ions) by expanding the list of adduct types covered by previous versions of CFM-ID, and (5) implementing these rules using standardized cheminformatics languages (SMARTS [32] and SMIRKS [33]) in order to rapidly and accurately predict and annotate ESI-MS/MS spectra for lipids.

#### 4.1.1. Acquisition of Reference Lipid MS/MS Spectra

The generation of the lipid fragmentation rules required the acquisition of experimental ESI-MS/MS spectra for a number of lipids and lipid classes. The acquired spectra were collected at several collision energies, for various adduct types (e.g., [M+H]+, [M−H]−), and, if possible, from various MS instruments. This was used to help capture fluctuations or biases that can be introduced by the different parameters. A total of 533 experimental MS/MS spectra were collected for 16 standard lipids (purchased from Avanti Polar Lipids Alabaster, AL) from 15 lipid classes at various collision energies (10 to 60 eV), in both positive and negative mode using an AB Sciex QTrap 4000 MS instrument. For each lipid standard, an enhanced MS (EMS) scan was first collected to identify precursor ions with high abundance in either ionization mode. Enhanced product ion (EPI) scans were then collected for each precursor ion to generate the MS/MS spectra with different collision energy levels ranging from 10 to 60 eV, with the supervision of a mass spectrometry expert. For more information about the collection of spectra for the 16 standard lipids, see Supplementary Material. In addition to the MS/MS spectra collected in our laboratory, published lipid MS/MS spectral data were compiled from the LIPID MAPS [28] and the MoNA [24] databases. For the LIPID MAPS spectra, only annotated spectral images were available. Therefore, MS/MS peak lists were generated by annotating the peaks using a semi-automated approach. This approach consisted of computing the relative abundance of each peak, and manually mapping it to the *m*/*z* list provided in the LIPID MAPS spectrum. In addition to the experimental spectra, the LipidBlast and FAHFA 26 libraries, as well as MassBank [34], mzCloud [35], and the sphingolipid library of Tsugawa et al. [18] served as references that provided additional information for lipid classes not covered by our experiments. In total, 844 lipid MS/MS spectra from 21 lipid classes were collected and analyzed.

#### 4.1.2. Annotation of Reference Lipid MS/MS Spectra

With the lipid MS/MS spectra in hand, we proceeded to manually annotate each spectrum. This consisted of assigning each fragment ion peak to a specific structure and a specific reaction or fragmentation event (e.g., the loss of a water molecule from a [M+H]<sup>+</sup> precursor ion, the loss of a side chain, or the presence of a specific fragment). The annotation of spectra was limited to the in-house generated MS/MS spectra and the LIPID MAPS set, as both were measured with the same model of instrument (AB Sciex QTrap 4000). The annotation process was largely guided by the information provided in LIPID MAPS, LipidBlast, and other scientific reports [17,28,29,36]. In a number of cases, the same compound had MS/MS spectra in at least two of the data sets (including the LipidBlast database), and the corresponding spectra were available for the same adducts or ions. In these cases, we annotated the spectra by direct comparison of the peak lists. Among the 21 lipid classes, 11 were not covered by our in-house experimental data. For this reason, the MS/MS spectra of these missing lipid classes were extracted from the LIPID MAPS (experimental) and/or LipidBlast (in silico) library. Since the experimental and theoretical spectra acquired from other sources (LipidBlast, LIPID MAPS) did not always cover all three collision energy levels (10, 20, and 40 eV), the generation of consensus fragmentation patterns was done by comparing standards to corresponding experimental spectra obtained under the same conditions (adduct type and collision energy). Moreover, when applicable, experimental MS/MS spectra collected from other sources (e.g., MoNA) and obtained under similar experimental conditions were compared to one another, as well as to theoretical spectra. In particular, theoretical spectra from LipidBlast helped in the spectral annotation. The fragmentation and spectral annotation rules were further validated by mining the scientific literature and acquiring/confirming additional spectral data from published papers. Once the energy-specific fragmentation patterns were generated, the relative abundance of each peak was assigned to one of the four intensity levels: low (1–15%), medium (15–60%), high (60–90%), or maximum (90–100%) abundance level. The assigned intensity was based on observed relative abundances from our experimental spectra. The maximum level of abundance was assigned to the base peak, typically when no fragmentation was observed (usually at a low collision energy). Additional feedback from local MS experts combined with an extensive review of the lipid MS/MS literature helped to complete the spectral annotation process. This effort led to the near-complete annotation of all observed fragment ions, their precise *m*/*z* values, and

the corresponding fragmentation reactions for a total of 610 peaks from 21 lipid classes at each of 3 collision energies (10, 20, and 40 eV).

#### 4.1.3. Implementation of Lipid Fragmentation Rules

The annotated fragment ions along with their structures and reactions provided the basis for the creation of fragmentation rules. All of the fragmentation rules were implemented in the Java programming language through a new "lipid fragmenter module" in CFM-ID. The structural backbone of each lipid or lipid fragment class was represented using the Daylight SMARTS language [32]. This is a module implemented in ClassyFire (version 2.1), a software tool for automating structure-based hierarchical annotation of chemicals [20]. To accelerate the lipid classification process, a sub-ontology from the ChemOnt [20] ontology was used. For each lipid or lipid fragment class, one set of fragmentation patterns is encoded for each of the applicable adducts as chemical reactions. The chemical reactions are represented using the Daylight SMIRKS language [33]. Moreover, SMARTS strings are used to select the appropriate fragments [32]. Additionally, a number of transformation rules were encoded to standardize the structures of all the query compounds. The standardization of the fragmentation reactions using well-developed cheminformatics languages ensures that the structural representations are consistent for all query compounds, structural classes, and chemical reactions. Without adhering to these standards many chemicals classes could be misidentified or invalid fragments could be returned.

The new CFM-ID lipid fragmenter program has been fully integrated into the existing spectral prediction workflow of the previous version of CFM-ID [12]. In CFM-ID 3.0, the lipid MS/MS prediction tasks require a lipid structure (submitted as a SMILES string or SDF file) and an adduct or an ion as input. Upon submission, the compound is classified based on its structure via ClassyFire. If the compound is identified by ClassyFire as a lipid molecule belonging to any of the covered classes and if the fragmentation patterns applicable to the selected adduct exist in the lipid fragmentation library, then the compound is fragmented accordingly. The fragmentation operation is executed using the AMBIT library [37]. After the in silico fragmentation step is completed, the relative abundance of each peak is assigned (using the fragmentation rules described above), and three ESI-MS/MS spectra are generated (at 10, 20, and 40 eV). Relative intensities are assigned using a set of pre-calculated intensities based on the chemical class, the adduct type, and the collision energy. The fragmentation patterns, as well as the relative intensities of the resulting peaks, are the same for all compounds from the same chemical class, under the same experimental conditions (i.e., adduct type and collision energy). If no set of fragmentation patterns is applicable to the compound and/or the selected adduct, and the compound is not a glycero-, phospho- or sphingolipid, then the ESI-MS/MS spectra are predicted using the original CFM-ID algorithm as implemented in CFM-ID 2.0. However, if the compound is a glycero-, phospho- or sphingolipid, the computation is stopped, and an error message is returned. This is done to ensure that CFM-ID does not use the combinatorial fragmentation algorithm, which, as mentioned, does not perform well for such compounds. The resulting ESI-MS/MS spectra are then returned with each peak annotated by its *m*/*z* value, its relative abundance, and the chemical structure of the corresponding fragment encoded in a standard SMILES format. Additionally, any available experimental MS spectra in the CFM-ID spectral database matching the query compound are also displayed in the results alongside the predicted spectra.

#### *4.2. Integration of Chemical Classification*

Similar structures tend to undergo similar MS fragmentation events under the same conditions. For this reason, a number of in silico MS fragmentation algorithms now take the chemical structure of query molecules into consideration for improved MS-spectra prediction and compound identification tasks. For the prediction of EI-MS/MS spectra, CFM's scoring function partly relies on a list that describes the presence or absence of 107 functional groups and 86 fragment descriptors. These groups and fragment descriptors are provided by ClassyFire [20] and RDKit [38], respectively. Other computational

tools such as CSI:FingerID [7] rely on models that can predict the presence of functional groups and fragments based on a given compound or a given MS-spectrum. For this reason, it might be expected that for compound identification tasks, the highest-ranked candidates would likely share a significant number of functional groups or possibly share a maximum common substructure. This information would be particularly helpful in cases where it is very difficult to discriminate between the highest-ranked candidates. More specifically, the presence of one or more common structural backbones (e.g., diterpene, ceramide, phosphatidylglycerol, etc.) could significantly impact the ranking, when very structurally similar candidates are prioritized among those that have a high spectral similarity to the query compound.

Therefore, a chemical classification was assigned to each compound in CFM-ID 3.0's database. The chemical classification was computed by ClassyFire and retrieved using the ClassyFire API [20]. As will be described later in this section, the chemical class assigned to candidate molecules was taken into account along with other metadata to improve the original CFM scoring method (dot product or Jaccard score). In addition to the adjustment of the scoring function, chemical classification was also used to predict the chemical class(es) to which the query compound belonged. Formally, the predicted chemical class corresponds to the direct parent of the highest-ranked candidate. In case of a tie, the predicted chemical category is the most frequently occurring direct or alternative parent among all candidates that has the highest score.

#### *4.3. Collection of Compound Citations*

Several studies have demonstrated that the integration of metadata can significantly improve compound identification rates with spectral library searches [7,9,19]. In particular, the frequency with which a compound is mentioned in the literature could serve as a proxy for the likelihood that the compound is sufficiently abundant for detection via MS/MS methods. Therefore, every compound in the CFM-ID spectral library was assigned a citation score. An initial set of citation counts was obtained using DataWrangler. DataWrangler is an in-house tool that automatically mines PubChem [39], HMDB [15], ChemSpider [40], and ChEBI [41], and returns a unique list of scientific reference citations for a given compound. A second set containing PubMed citation counts (without PubMed IDs) was obtained by mining the EPA's CompTox dashboard [42]. This set was computed and generously provided to us by the CompTox dashboard's development team. The two sets were merged by comparing each compound's InChI keys. More specifically, when a compound had a citation count in only one set, the corresponding citation count was assigned to that compound. For compounds that had citation counts both from DataWrangler and CompTox, the largest count was assigned, as it was expected that both counts could include many of the same citations. A total of 140,379 compounds were assigned a citation count of 1 or more. For the remaining compounds, DataWrangler assigned a custom citation count of 1, if and only if, they were found in at least one of the following databases: HMDB [15], DrugBank [21], T3DB [43], ContaminantDB [44], FooDB [45], ECMDB [46], YMDB [47], and PhytoHub [22].

#### *4.4. Redesigning and Expanding of CFM-ID's Spectral Library*

The original reference spectral library in CFM-ID 2.0 contained unique computationally generated ESI-MS/MS spectra for ~51,000 compounds from the HMDB and KEGG databases. These in silico ESI-MS/MS spectra were computed in positive ([M+H]+) and negative ([M−H]−) ionization modes, one for each of three collision energies (10, 20, and 40 eV). In order to significantly improve identification rates, the new CFM-ID library was updated as described below.

#### 4.4.1. Collection of Experimental MS/MS Spectra from External Sources

While the accuracy of computationally predicted MS spectra is often quite good, the accuracy of experimentally collected MS spectra is obviously much better. Therefore, the inclusion of experimentally determined ESI-MS/MS spectra would be expected to improve the match scores

for query spectra/compounds that have previously been analyzed by ESI-MS/MS. Experimentally determined ESI-MS/MS spectra were downloaded from the MassBank of North America's (MoNA) online repository [24]. As of February 2019, MoNA contained 89,861 experimental LC-MS/MS spectra for 12,799 compounds. The spectra and compounds in MoNA originate from several databases, including the HMDB database [15], MassBank [18], the GNPS database [23], and the ReSpect database [48], among others. Only experimental spectra were collected from MoNA. An additional 915 ESI-MS/MS spectra were manually regenerated for 523 compounds from information contained in the NIST 14 database.

Since CFM-ID uses models trained on MS/MS spectral data sets that use specific collision energy and mass accuracy criteria, the HMDB, MoNA, and NIST spectra were further filtered to match these criteria. Specifically, experimental MS spectra were required to have a known ionization type, a known compound neutral mass, and to have been analyzed with high-resolution MS instruments (e.g., Q-TOF instruments) in the case of LC-MS spectra. After filtering, there were 72,678 usable experimental MS/MS spectra remaining from the MoNA dataset. These experimental MS/MS spectra were converted into the peak list format required for CFM-ID and uploaded into CFM-ID's online spectral library. In addition, the complete library of experimental MS/MS spectra from the HMDB was obtained from our in-house repository, and filtered. Upon filtering, this library contained 1152 unique ESI-MS/MS spectra for 239 unique compounds.

#### 4.4.2. Compilation of Predicted ESI-MS/MS Spectra

The original CFM-ID 2.0 database contained 102,153 unique computationally generated ESI-MS/MS spectra (from 51,635 compounds). Among the 102,153 ESI-MS/MS spectra, 36,746 were previously computed for 18,373 unique compounds belonging to the 21 lipid classes covered by the rule-based fragmenter, and transferred to the CFM-ID 3.0 database. The remaining 65,407 mass spectra computed by CFM-ID 2.0 were also moved to the CFM-ID 3.0 database. In total, ~36,900 spectra were generated for 18,438 lipids. To this database, another ~207,956 ESI-MS/MS spectra were computed for another ~80,000 lipids and 7288 other metabolites obtained from new versions of HMDB, DrugBank, and PhytoHub. These compounds were added to the CFM-ID 3.0 database. These predicted ESI-MS/MS spectra were generated for both positive and negative ion mode as well as at three different collision energies (10, 20, and 40 eV). In total, the CFM-ID 3.0 database now contains 310,109 computationally generated ESI-MS/MS spectra (from 155,544 compounds). If the experimental ESI-MS/MS spectra are added to this total, the CFM-ID 3.0 spectral database now contains a grand total of 393,158 ESI-MS/MS spectra.

#### *4.5. Modifying CFM-ID's Scoring Function and Ranking Schema*

The results of the Critical Assessment of Small Molecular Identification (CASMI) 2016 contest showed that the integration of additional data (i.e., citation frequency of compounds and structure similarity) into the original scoring function for CFM-ID improved compound identification rates [19]. This trend was also observed for several other tools during the contest and in separate studies [9,19]. To create a combined score, the original dot product spectral similarity score computed by CFM-ID was combined with a citation score and a chemical classification score. The citation score is based on the number of citations that a given compound has in the scientific literature. More highly cited compounds are typically those that are more commonly detected, studied, or used. Therefore, the citation score serves as a proxy of the general abundance or concentration of a compound and is intended to favor more abundant compounds over extremely rare or compounds at very low biological concentrations.

The chemical classification score is based on the number of chemical categories to which a compound is assigned (by ClassyFire), relative to the total pool of chemical classes assigned to all candidate molecules. The chemical classification score was added to help re-rank or cluster structurally similar molecules (and MS spectra) closer together. Each of the three scores was normalized by dividing

its computed score by the maximum score across the candidate list. The general formula for the total candidate score is:

$$S\_{\rm TOTAL}(\mathcal{C}) = a\_{\rm CFM\_{\rm ORG}} \ast S\_{\rm CFM\_{\rm ORG}}(\mathcal{C}) + a\_{\rm CLASS} \ast S\_{\rm CLASS}(\mathcal{C}) + a\_{\rm REF} \ast S\_{\rm REF}(\mathcal{C})$$

where *STOTAL*(*C*), *SCFM\_ORIG*(*C*), *SCLASS*(*C*), and *SREF*(*C*) are the total score, the normalized spectral matching CFM-ID score, the normalized ClassyFire score, and the normalized reference score for candidate *C*, respectively. Each of the three scores are weighted by the coefficients a*CFM\_ORIG*, a*CLASS* , and a*REF*, respectively, where:

*aCFM*\_*ORIG*, *aCLASS*, *aREF* ≥ 0

and

$$a\_{\mathrm{CFM\\_ORIG}} + a\_{\mathrm{CLASS}} + a\_{\mathrm{REF}} = 1$$

This scoring function was then iteratively optimized on a training data set to maximize its metabolite identification potential. In particular, the optimal set of coefficients was determined through a grid search using a manually selected set of 1000 spectral/compound identification tasks (for 1000 unique compounds ranging from drugs to lipids). Each of the selected molecules had one or more experimental spectra at one of three level energies (10, 20, and 40 eV), in addition to the predicted ESI-MS/MS spectra. The data set was divided into five equally sized subsets. Several models (with a unique combination of coefficients) were trained on 800 compounds (4/5 of the data set) and tested on the remaining 200 (1/5 of the data set). This process was repeated four more times, using a different test set of 200 compounds for each iteration. Experimental spectra were used as input for each identification test, and upon testing, only the best model was selected. A consensus model was built based on the five selected models, and further tested using a smaller test set. The final coefficient values for the ESI-MS/MS scoring function were a*CFM\_ORIG* = 0.6, a*CLASS* = 0.1, and a*REF* = 0.3.

This function was further refined to improve its performance and to deal with certain extreme or rare situations. In particular, we observed certain cases in which the spectral similarity between a query and a database match is so close that applying the citation and chemical classification scores causes such strong matches to be unfairly penalized. In order to prevent this, we implemented a 95% similarity threshold, above which only the original spectral similarity score is applied, and the citation and classification information is disregarded. Moreover, our training showed that the enormous discrepancy between citation counts (from 2 to >30,000) could negatively impact the identification rate. For instance, a compound that would be correctly identified using spectral similarity (with or without metadata) could be easily ranked much lower, in favor of a similar compound with a significantly high citation score. Thus, the citation count was adjusted to have a ceiling corresponding to 156 (2 × average citation count) for every compound that has over 156 citations.

#### *4.6. Performance Testing*

Three types of performance tests were conducted. The first assessed the performance of the lipid ESI-MS/MS spectral prediction method, the second assessed the performance of the new scoring function in exact compound identification, and the third assessed CFM-ID's performance in identifying a compound's correct chemical class. To test the lipid ESI-MS/MS spectral prediction method, a benchmark analysis was performed on 20 randomly chosen lipids from each of the 21 known lipid classes for which fragmentation rules were derived. The computation was performed on a 2.7 GHz Intel Core i5 running macOS with 16 GB (1867 MHz DDR3) of memory. A total of 120 ESI-MS/MS spectral predictions were generated for both CFM-ID 2.0 and CFM-ID 3.0 at 3 different energies and 2 different ionization modes with various adduct types. The average execution time was determined for each spectral prediction. In addition to the execution time comparison, an additional performance comparison was conducted to assess the quality of the predicted MS/MS spectra. For this task, a set of 5 experimental ESI-MS/MS spectra measured in positive ion mode, and 9 experimental ESI-MS/MS spectra measured in negative ion mode were selected. The selected spectra were measured under conditions that can be simulated by CFM-ID 3.0's lipid fragmentation rules (same energy levels, same adducts). For each experimental MS/MS spectrum, CFM-ID 2.0 and CFM-ID 3.0 were used to predict a corresponding MS/MS spectrum under the same conditions. The performance was assessed by measuring the average pairwise spectral similarity between experimental and predicted spectra using a standard dot product score as implemented in the OrgMassSpecR package [49]. Moreover, they were also compared to LipidBlast, as the selected lipids and corresponding predicted spectra were also contained in the LipidBlast library.

The ESI-MS/MS scoring function was tested on a set of 208 experimental ESI-MS/MS spectra (for 185 unique compounds) generated on a Q Exactive Plus Orbitrap (Thermo Scientific), and used for the CASMI 2016 contest (Category 3) [19]. These spectra were used as input for compound identification. One-hundred and twelve of the 185 compounds were included in the database and had at least one experimental ESI-MS/MS spectrum in addition to the pre-computed ones. For each of the remaining compounds, ESI-MS/MS spectra were predicted using CFM-ID and stored in the spectral library. To assess the performance, we used CFM-ID 2.0 and CFM-ID 3.0 scoring functions, separately, to attempt to identify the query compounds. The evaluation was performed in two steps. The first consisted of identifying compounds by searching them in the CFM-ID spectral library that included the 208 experimental ESI-MS/MS spectra from the CASMI. In a second step, the search was performed after excluding the 208 experimentally acquired spectra in the searchable portion of the database. The required mapping between OrbiTrap collision energies and Q-TOF collision energies (which are used by CFM-ID) is described in the Supplementary Material.

For the third assessment, CFM-ID 3.0 was evaluated on its performance in chemical class prediction/identification. This particular task assessment was included because in many situations involving MS-based metabolomics or MS-based natural product identification, it may not be possible to identify the exact compound via MS/MS spectral matching. Therefore, the ability to use MS/MS spectra to reduce the candidate list and to predict the correct chemical class or correct chemical family for a given query spectrum or compound can be very valuable. In assessing the performance of CFM-ID's chemical class prediction, the query compound was predicted to belong to the "direct parent" class of the highest-ranked candidate. In cases of a tie, the chemical class was predicted to be the most frequently occurring among all the direct and alternative parents among all the compounds with the highest score.

#### **5. Conclusions**

We have shown that it is possible to substantially improve CFM-ID's performance in both spectral prediction and compound identification tasks. This was achieved through a number of ways including (1) integrating a rule-based fragmentation approach that currently applies 344 manually curated rules to predict the ESI-MS/MS spectra for 21 classes of common, biologically important lipids, (2) modifying the structure of CFM-ID's spectral database, and increasing its size by a factor of 2.6, (3) designing new scoring functions that take into account both compound citation frequency and chemical classification features of candidate molecules, and (4) implementing a chemical classification algorithm based on spectral similarity.

In particular, the implementation of a rule-based approach for fragment ion prediction was shown to improve the speed by a factor of 200X and the accuracy of the lipid ESI-MS/MS spectra prediction by a factor of 10X. The success of using rule-based fragmentation patterns encoded in standard chemical representations (SMILES, SMARTS, and SMIRKS) suggests that this concept could be successfully applied to other classes of modular molecules such as carnitines, polyphenols, terpenes, and carbohydrates. The construction and expansion of CFM-ID's spectral library has also helped CFM-ID's overall performance. The most recent spectral library has been expanded by a factor of 2.6 over the previously released spectral library. This expansion process is still ongoing, and we plan to include ~500,000 more compounds including drugs, lipids, environmental pollutants, phytochemicals, food compounds, as well as their predicted metabolites generated by BioTransformer [50]. The new scoring function, which already showed an improvement over CFM-ID 2.0's scoring function, could potentially be further improved by using machine learning techniques and training over a much larger set of MS/MS spectra. Moreover, the acquisition and incorporation of other metadata, such as retention time or collisional cross section information, could help further increase the compound identification rates, as demonstrated in several recent studies [9,19].

The fields of in silico metabolomics and in silico mass spectrometry are rapidly evolving. Thanks to the many excellent ideas emerging in many labs around the world and the willingness of many researchers to share their code and their databases, it is likely that these fields will continue to grow and continue to inspire others to make MS spectral analysis, MS spectral prediction, and MS-based compound identification better, faster, and even more informative.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/4/72/s1.

**Author Contributions:** Conceptualization, Y.D.-F., and D.S.W.; methodology, Y.D.-F., D.S.W., and J.Z.; standard acquisition, J.Z. and Y.D.-F.; MS data interpretation, Y.D.-F., N.K., and J.Z.; generation of fragmentation rules, Y.D.-F.; validation of fragmentation rules, Y.D.-F., N.K., and J.Z.; spectral library expansion, Y.D.-F., A.P., C.L., D.A., and M.G.; scoring function learning and evaluation, Y.D.-F. and A.P.; software, Y.D.-F., A.P., and C.L.; writing—original draft preparation, Y.D.-F., J.Z., A.P., and D.S.W.; writing—review and editing, Y.D.-F., J.Z., A.P., C.L., N.K., D.A., F.A., M.G., and D.S.W.

**Funding:** This research was funded by Genome Canada and Genome Alberta.

**Acknowledgments:** The authors would like to acknowledge Tanvir Sajed (University of Alberta, Edmonton, AB, Canada) for providing spectral data from the Human Metabolome Database, Fei Wang (University of Alberta) for helping with the semi-automated annotation of peak list from the LIPID MAPS set of standards, and Antony J Williams (U.S. Environmental Protection Agency) for providing citation metadata from the chemical dashboard.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
