1. Introduction
Metabolic fingerprinting, also called untargeted metabolomics, aims at analyzing all detectable metabolites in a given sample. It is a semiquantitative approach that focuses on spotting significant differences in metabolite abundance among different groups of a sample set [
1,
2,
3,
4,
5]. To this end, liquid chromatography–mass spectrometry (LC–MS) is widely used.
Issues such as matrix effects (i.e., the variability of ionization efficiency); analytical variability, e.g., decline in instrument performance when analyzing large sample batches; and difficulties in annotating the detected features are hurdles of this approach [
6,
7,
8]. Metabolic profiling, on the other hand, circumvents these obstacles by using known amounts of labeled internal standards (IS), and tailored sample preparation and LC–MS methods for the optimal detection and absolute quantification of a predefined but often relatively small set of target metabolites [
9,
10]. Often relative, as in metabolic fingerprinting, instead of absolute quantification is sufficient. In this context, stable isotope-labeled internal standards can improve analytical repeatability and allow for the relative quantification of a large number of metabolites in a single LC–MS run. The question remains, which internal standards are best to use when all the detectable metabolites are targeted? One approach is to use a stable isotope-labeled reagent to derivatize a sample aliquot or a pool sample to generate a sample specific IS [
11,
12,
13,
14]. Yet, this approach is only selective to certain compound classes that are covered by the chosen derivatization. Global labeling of the whole metabolome of an organism was used in targeted [
15] and untargeted [
16] analyses, but it was limited to the respective organism. Dethloff et al. [
17] used partially labeled mouse plasma as an IS for amino acid analysis in human plasma. Furthermore,
13C-labeled yeast extracts have been used (ISOtopic solutions, Vienna, Austria) [
9]. Employing a complex
13C labeled yeast extract is also the basis of the IROA TruQuant kit (IROA Technologies LLC
®, Bolton, MA, USA).
Isotopic ratio outlier analysis (IROA) is a mass-spectrometry-based technique that helps to distinguish features of biological origin from nonbiological artefacts and, secondly, aims at providing more reliable quantitative data by correcting for ion suppression and analytical variance [
18]. The kit includes two types of standards. The long-term reference sample (LTRS) is a fully labeled
S. cerevisiae yeast cell extract, labeled at 5% and 95%
13C and mixed 1:1. This translates to a unique isotopic pattern for all features of a biological origin, whereas artefacts would exhibit the isotopic pattern of natural abundance, allowing for a distinct filter criterion. Note that the LTRS is injected
as is, without mixing with any specimens of interest. Thereby, the LTRS serves as a quality control (QC) for instrumental performance. It also allows for a validated annotation of the detected features that show the expected isotopic pattern, yielding a reference library. The second standard is the internal standard (IS), which only contains the 95%
13C labeled yeast extract. It is added to the samples to be analyzed, providing the advantages of a traditional internal standard by correcting for ion suppression and analytical variance. It also offers a more robust annotation process, benefiting from the reference library built using the LTRS measurements [
6,
18]. The key advantage compared to a traditional internal standard is that it comprises a very large number of labeled compounds, i.e., all metabolites present in yeast. However, as the exact quantities of the labeled compounds are not known, it allows for relative but not for absolute quantification.
Metabolomics data acquired in separate experimental batches may exhibit variation unrelated to biological differences among different sample groups. The sources of this additional analytical variance are, for example, varying experiment times, reagents, and instrument performance. Even when using the same reagents and instruments, the performance of the high-performance liquid chromatography (HPLC) system and the mass spectrometer response may deviate over time. These effects are aptly termed ‘batch effects’ [
19]. Their impact can be mitigated using batch-effect correction algorithms (BECAs) such as “removebatcheffect” (RBE), which is a function in the Limma package in R [
20,
21], where the batch effect is modeled by an additional coefficient. However, they can only be applied in cases where data have been acquired in distinct batches and not in cases where samples were measured continuously over prolonged periods of time. Furthermore, BECA’s effectiveness is questionable in some cases as they could be misapplied, and the fitted model might introduce bias into the data [
19]. Taking measures early in the workflow might produce better results than circumventing batch effects via meta-analysis. Having a reference sample such as the LTRS could help to account for changes in chromatographic conditions, instead of using more lenient thresholds in feature alignment and weak signal recovery [
22]. The implemented IS can also correct for shifts in the detector response.
Several studies [
6,
23,
24] have employed the isotopic ratio outlier analysis technique in model organisms. In such experiments, only one type of sample matrix is involved, and any features that do not exhibit the IROA isotopic pattern are indeed artefactual. Here, however, we used the labeled yeast extract, i.e., the IS of the TruQuant kit, as an internal standard to analyze human urine specimens. Thus, we are dealing with two different matrices. If a feature does not exhibit the IROA isotopic pattern, it could merely be a compound that is urine-specific or not produced in yeast in detectable quantities. With this in mind, we investigated the added value of incorporating the IROA TruQuant kit in our untargeted analysis of human urine.
We analyzed a set of 244 spot urine specimens from the German Chronic Kidney Disease (GCKD) project collected at the baseline time point [
25]. The samples were analyzed by means of a high-performance liquid chromatography (HPLC) system coupled to a time-of-flight mass spectrometer (TOFMS). The resulting data are referred to as TOF. To generate reference datasets, a subset of 56 of the urine specimens was additionally analyzed with two different methods. The first was MS-based quantitative data of amino acids (AAs), referred to as “quant”. The second, referred to as “NMR”, stemmed from a nuclear magnetic resonance (NMR) quantitative analysis.
2. Materials and Methods
2.2. Sample Preparation
The IROA Kit was applied to 244 urine samples from the German Chronic Kidney Disease (GCKD) project [
25], from the baseline visit. The study was executed in accordance with the Declaration of Helsinki and registered in the national registry for clinical studies (DRKS 00003971). All study procedures and protocols were approved by the ethics committees of all participating institutions (Friedrich Alexander University Erlangen–Nuremberg, the Medical Faculty of the Rheinisch–Westfälische Technische Hochschule Aachen, Charité University Medicine Berlin, the Medical Center University of Freiburg, Medizinische Hochschule Hannover, the Medical Faculty of the University of Heidelberg, Friedrich Schiller University Jena, the Medical Faculty of the Ludwig Maximilians University Munich, and the Medical Faculty of the University of Würzburg). The study was carried out in accordance with relevant guidelines and regulations. Written declarations of informed consent were obtained from all study participants before inclusion. All specimens were stored at −80 °C until preparation of samples for mass spectrometry.
Various factors affect the abundance of urinary metabolites, such as liquid and food intake and many processes regulating the body’s solute and water content. Normalization is needed to balance out these differences among individuals and reveal actual biological variance. Creatinine is commonly used to achieve that as it is mainly excreted by glomerular filtration at a constant rate without reabsorption in the renal tubule [
26]. Urine was diluted with pure water to a similar creatinine concentration to correct for urinary output. Since the individual dilution of each sample is impractical, samples were grouped based on the creatinine concentration, and a fixed dilution factor was used for the respective group, see
Table 1. Creatinine concentrations determined in the course of a corresponding NMR analysis of the samples were utilized.
Samples were diluted to a volume of 20 µL. For the internal standard, five vials of IROA–IS were used. To each vial, 600 µL of pure water was added, instead of the recommended 1200 µL, to keep a proper concentration after dilution with urine. All five resulting IS solutions were mixed to form a homogenous 3 mL IROA–IS solution. Then, 10 µL of IROA–IS was added to each diluted urine sample, resulting in a final volume of 30 µL. Two in-house QC urine samples were prepared similarly. QC1 and QC2 have creatinine concentrations of 19.71 mM and 8.92 mM, respectively, and were diluted 1:8 and 1:4 with water correspondingly, to reach a volume of 120 µL. Sixty microliters of IS was then added to each QC, which allowed one to maintain the same sample:IS ratio used to prepare the samples. The final volume of 180 µL of each QC was necessary to allow multiple injections throughout the experiment.
For the LTRS, 80 µL of water was added to reach a similar concentration of the IS in the samples. IS blank was prepared by adding 80 µL water to a 40 µL IS solution. A water blank was included.
For NMR analysis, samples were prepared by mixing 400 μL human urine with 200 μL of 0.1 M phosphate buffer, pH 7.4, which contained 3.9 mM boric acid to impair bacterial growth. Furthermore, 50 μL of 0.75% (wt) trimethylsilylpropanoic acid (TSP) in deuterium oxide (D2O) and 10 μL of 81.97 mM formic acid (FA) were added as internal reference standards. NMR spectra were acquired on a Bruker 600 MHz Avance III HD NMR spectrometer (Bruker BioSpin GmbH, Rheinstetten, Germany) equipped with a helium cooled cryo–probe and an automated sample changer. For each sample, a 1D 1H spectrum was acquired employing a Carr–Purcell_Meiboom–Gill (CPMG) pulse sequence to facilitate the suppression of macromolecular signals. From the obtained spectra, metabolites were quantified by the fitting of reference signals employing the Chenomx NMR suite v. 8.6 (Chenomx Inc., Edmonton, Canada).
To investigate whether the urine specimens used in the experiment contained a sufficient concentration of metabolites, we performed the recommended calibration experiment according to the protocol in the Kit’s product information sheet [
27]. The content of the internal standard (IS) vial was solubilized in 1.2 mL pure water and that of the LTRS in 40 μL pure water as suggested in the protocol. A urine specimen with a high creatinine level (20 mM) was diluted with pure water to creatinine concentrations of 0.25, 0.5, 1.25, 1.5, 2, 2.5, 3.75, 5, 7.5, 10, and 20 mM, respectively. Forty-microliter aliquots of each dilution were dried in triplicate employing an infrared vortex vacuum evaporator (CombiDancer, Hettich AG, Baech, Switzerland) and reconstituted in 40 μL IS solution.
As a result, a graph (see
Figure S5) has been obtained that indicates the creatinine concentration to which urine specimens shall be adjusted. According to the product information sheet [
27], this concentration “
yields an overall mass spectral signal that is equal to the overall mass spectral signal of the IS. This is the amount of sample that will most accurately be measured using the IS in the future, i.e., well balanced by the standard 40 μL of IS.” In
Figure S5, this is the intersection between the lines of the normalized IS MSTUS (blue squares), the normalized
12C MSTUS (green crosses), and the line of the
12C values corrected for ion suppression (red circles). In our case, this corresponds to about 6.5 mM creatinine. We then repeated the original experiment using a subset of 26 urine specimens with original creatinine concentrations equal to or greater than 6.5 mM. Aliquots of 40 μL of urine with a creatinine concentration of 6.5 mM (either pure or prediluted to 6.5 mM) were dried and reconstituted in 40 μL IS (dissolved according to protocol in 1.2 mL) so that each sample contained a final creatinine concentration of 6.5 mM. The LTRS and IS blanks were also prepared according to the protocol. Additionally, aliquots from the same 26 urine specimens were diluted to 2 mM creatinine to obtain concentrations comparable to our original data set, and aliquots of 40 μL were also dried and reconstituted in 40 μL IS.
3. Results
Across all samples, 423 potential IROA features were found out of 12,938 unknown features. Quality control based on accurate masses, retention time, and feature intensity (see
Section 2.6 Data processing) was applied to filter out and match the IROA
12C and fully labeled
13C signals. This left 224 features, resulting in 112 IROA pairs. Using our in-house retention time library, the identity of 27 out of these 112 were confirmed, see
Table S1.
To assess the added value of using the kit, the data were evaluated once using the absolute peak areas of the
12C signal, referred to as “TOF absolute”, and then again using the ratios to the respective internal standards (
12C/
13C), referred to as “TOF ratios”. Subsequently, AAs detected with the two other methods, ‘‘quant’’ and “NMR”, see
Table S1, were compared to both “TOF absolute” and “TOF ratios”. Leucine and isoleucine were chromatographically improperly resolved in the TOF data; hence, they were excluded, leaving 11 AAs covered both by quant and TOF absolute/TOF ratios.
Figure 2 shows the Spearman correlation plots of four exemplary AAs ranging from high correlation (top) to low correlation (bottom), where the TOF data is represented by either TOF absolute (left) or TOF ratios (right) and compared to quant. Similar comparisons against quant were performed for all 11 mutual AAs. The corresponding Spearman correlation coefficients are given in
Table 2. Neither substantial differences using TOF absolute versus TOF ratios nor a clear trend can be seen. For a better visual comparison, the coefficients of both comparisons for all 11 AAs were plotted in a scatter plot, see
Figure 3.
To analyze whether there are significant differences in the agreement of the TOF absolute or the TOF ratio method with either the quant or NMR method, we computed, for both cases in each data, set individual differences (“diff” values) from the respective regression lines (see
Section 2.7 Data evaluation). These sets of diff values were then analyzed by means of statistical tests. In case of normally distributed diff values (
Table S2) a
t-test was used; otherwise, a paired Wilcoxon test [
31] was performed. The results shown in
Table 3 suggest no significant difference between the correlation of TOF absolute and TOF ratios against quant.
Similar comparisons were performed for six AAs that could be quantified by NMR (
Table S1).
Figure 4 shows the Spearman correlation plots for the comparisons of TOF absolute and TOF ratios to NMR for four exemplary AAs. The Spearman correlation coefficients of the six AAs are shown in
Table 4. The “diff” values were also calculated for each dataset. The results of paired Wilcoxon or
t-tests are shown in
Table 5. Similar to quant, no significant difference between “TOF absolute” and “TOF ratios” in the correlation is seen. Furthermore,
Figure S1 shows Spearman correlation plots between “NMR” and “quant” for the six overlapping AAs.
Overall, a poor correlation was observed for aspartate and proline in all comparisons due to the low aspartate and proline concentrations in urine, see
Figure S2. The correlation with the NMR data was overall lower, which can be attributed to difficulties in NMR data analysis due to overlapping signals.
To further investigate the differences between TOF absolute and TOF ratios, relative standard deviation (RSD) values of the QCs were considered in both datasets. Note that two urine samples served here as QCs and were analyzed repeatedly throughout the batch, resulting in 14 measurements each. The RSDs of the 112 features were calculated as TOF absolute and TOF ratios for QC1 and QC2 separately; then, in addition, the average RSD of both was calculated, resulting in a total of six lists of RSD values.
Figure S3 shows histograms of the average RSD values for TOF absolute and TOF ratios. TOF absolute exhibits more features in the lower RSD range. Statistical analysis was performed to check whether the difference is significant. Shapiro–Wilk normality tests [
32] for all six lists of RSD values show significant
p-values (
Table S3). Thus, paired Wilcoxon tests were conducted (
Table S4). The resulting
p-values together with the data presented in
Figure S3 suggest that the RSD in TOF absolute is significantly lower than in TOF ratios. This can be caused by the erroneous automatic peak integration of the internal standard peaks, which will in turn lead to erroneous ratios. In general, abundant analytes with peaks way above noise level are less challenging for automatic peak integration algorithms, and vice versa. To further investigate that, the 112 features were divided into four quantiles (Q1–Q4) based on the peak area of the respective internal standards. Q1 contains the features with the largest IS peak areas, and Q4 has the lowest. Shapiro and paired Wilcoxon tests between the average RSD values in each quantile were conducted, see
Table S5. Q1 shows a nonsignificant Wilcoxon
p-value, suggesting that features with the abundant IS signal show no significant difference between absolute and ratios. Although Q2 already shows a significant
p-value, it is higher than those of Q3 and Q4. Individual features were checked, and the three features with the highest RSD in TOF ratios were inspected. The first one shows integration errors despite having an abundant IS; this is justified by the isomer coeluting close to the IS peak,
Figure S4. The second has a mismatch between the endogenous feature and its IS. The third has again a closely eluting compound interfering with the integration.
The ratio approach was also tested for its usability for batch-effect correction. For this, a subset of 56 samples randomly drawn from the original set of 244 measurements were denoted Batch I. The corresponding 56 samples were measured again, six months later, using the same methods (Batch 2). As a comparison, the batch effect–correction algorithm “removebatcheffect” (RBE) of the Limma package was tested [
21].
Two feature lists were created for the comparison containing the data from both batches each. The first is the feature list yielded by the typical fingerprinting analysis workflow, containing all detectable features meeting the analysis parameters. It comprised 3291 unidentified features and is referred to as the “T–list”. To account for RT shifts between the two batches, a higher than usual RT tolerance (0.3 min instead of 0.2 min) was used for peak alignment in the T-List. The second list was a feature list of metabolites that were identified using the IROA kit. It comprised 115 metabolites showing the IROA isotopic pattern, which were detected in both batches, and their respective internal standards. The LTRS measurement in both batches was used to identify and align the metabolites. From the second list, two further lists were generated. The list comprising the absolute peak areas of the metabolite without considering their ISs is referred to as the IROA–Abs list. The list in which the peak areas of the metabolites were divided by those of their corresponding ISs is called IROA–ratio list. The PCA score plots of T–list and IROA–Abs list showed a pronounced shift between the two batches (
Figure 5a,b). This can be clearly observed by considering the clustering of the respective QC samples. The closer the clusters of the same QC, the less pronounced are the batch effects, and vice versa. Leek et al. demonstrated that normalization alone does often not remove batch effects [
33]. We applied the probabilistic quotient normalization (PQN) [
34] followed by a Z–transformation for each batch separately in the T-list. This reduced the shift in the QC samples (
Figure 5c). Using the IROA–ratio list also provided a considerable improvement over the IROA–Abs list,
Figure 5d. However, applying PQN followed by RBE produced the best results, whether on the T-list or the IROA–Abs list, with data points from the two batches pairing well and QCs clustering more closely (
Figure 5e,f).
The original and new LTRS measurements (from the creatinine concentration test) were both analyzed separately by CF4 with the same parameters. The new LTRS measurements yielded 417 complete bins (uncured), compared to 232 complete bins (uncured) for the original dataset (higher dilution of the LTRS).
In the results file from CF4, three types of relevant outputs are considered: Raw, i.e., the absolute
12C peak area, corresponds to “TOF absolute”;
12C/
13C Ratios that correspond to “TOF ratios”; and Suppression Corrected, i.e., peak areas that are corrected for ion suppression, as the name states. To this end, in CF4, the
12C/
13C ratio is multiplied by the corresponding least suppressed
13C area. Spearman correlation coefficients with results obtained by a targeted MS analysis (see
Section 2 for details) were calculated for both 2 mM and 6.5 mM creatinine, employing all three outputs. The results are shown in
Table S7. For comparison, the corresponding coefficients from analyzing the same samples in MZmine are given.
The ability to correct for batch effects was considered again by reanalyzing the same original two batch effect datasets using CF4. Four relevant outputs were considered, i.e., Raw, Ratios, Suppression Corrected, and Normalized (MSTUS). The resulting PCAs are shown in
Figure S6.
4. Discussion
The data shows no significant improvement if TOF ratios are used instead of the TOF absolute when comparing the results to reference measurements. This poses the pivotal question of whether the use of an IS in untargeted LC–MS metabolomics carries any benefit. The answer lies within the nature of the metabolic fingerprinting approach itself. Trying to be as universal as possible makes one miss out on the perks of tailored methods that are designed to take full advantage of having an IS. This includes manual peak integration, or the manual inspection of automatically integrated peaks, which cannot be practically applied to metabolic fingerprinting.
A complex matrix such as urine that is further mixed with a complex labeled yeast extract poses a challenge for automated data analysis. One example is the case of aspartate. In both
Figure 2 and
Figure 4, aspartate shows a better correlation to quant without the IS. When inspecting the chromatographic peaks, we notice two main problems: there are samples where the
12C aspartate peak is so low in abundance that it falls within the noise, see
Figure 6a. This explains the generally lower correlation even in “TOF absolute”. Furthermore, a peak with an
m/z of 138.0549 elutes close to aspartate. The mass difference to
13C aspartate (
m/z 138.0578) is below the 3 mDa tolerance used in data processing, hence resulting in an interference. The tentative identification of the neighboring peak suggests the
12C peak of trigonelline, a product of niacin (vitamin B3) metabolism, which is excreted in urine. The abundance of trigonelline varies across the samples.
Figure 6b–d show that the larger the peak of trigonelline, the higher the interference with the
13C aspartate coming from the IS. While using a stricter
m/z tolerance might solve this problem, it could cause problems in data processing for other compounds. This is in agreement with the work by Qiu et al. [
35], who showed the advantage of higher resolution MS instrumentation for peak detection and identification.
Furthermore, employing a labeled cell extract as an internal standard will cause low abundant metabolites to have low signal intensities of the respective labeled compounds, resulting in a higher variance or difficulties in data integration.
In general, using a yeast extract for urine analysis is not optimal. We, however, did not address the other aspects of using the kit here, such as the range of covered metabolites, which would be readily affected by the matrix type. We, instead, focus on a few identified metabolites and investigate the use of a complex IS and its ability to improve the data correlation to a reference dataset. The lack of improvement when using ratios instead of absolute peak areas can also be explained by an analysis of stability. We did not observe any shifts or the deterioration of the HPLC–TOFMS performance when analyzing the 244 samples. This is exemplified by the PCA scores plot shown in
Figure 7a. The QC samples, two urine samples analyzed 14 times each, cluster closely together.
Regarding the batch effects, the application of the kit helped to reduce them. Nevertheless, the IROA–ratio list might show even better results when comparing several batches measured over a prolonged time or even measured on different instruments. Additionally, since RBE works when there are distinct batches, IROA ratios would be more beneficial if samples are measured in a single batch over a long period of time and a gradual decline in instrument performance occurs throughout the batch. This, however, is beyond the scope of this study.