**Metabolomic Variability of Di**ff**erent Genotypes of Cashew by LC-Ms and Correlation with Near-Infrared Spectroscopy as a Tool for Fast Phenotyping**

**Elenilson Alves Filho 1, Lorena Mara Silva 2, Ynayara Lima 3, Paulo Ribeiro 2, Ebenézer Silva 2, Guilherme Zocolo 2, Kirley Canuto 2, Selene Morais 3, Ana Cecília Castro <sup>2</sup> and Edy de Brito 2,\***


Received: 6 May 2019; Accepted: 20 June 2019; Published: 25 June 2019

**Abstract:** The objective of the present work was to develop an advanced fast phenotyping tool to explore the cashew apple compositions from different genotypes, based on a portable near-infrared (MicroNIR) spectroscopy. This will be in addition to associating the variability of the respective cashew apple pulps with the genotypes by ultra-performance liquid chromatography (UPLC), coupled with high-resolution mass spectrometry (HRMS). The NIR analysis is a non-destructive, low-cost procedure that provides prompt results, while considering the morphology of different cashew apples (shape, size, and color). The UPLC-HRMS analysis is characterized by specific bioactive compounds, such as the derivatives of hydroxybutanoic acid, galloyl, and flavonoids. Furthermore, both techniques allowed the identification of a group of accessions, which presented similarities among the chemical profiling. However, to improve the understanding of cashew chemical and physical variability, further variables related to the cashew apple composition, such as edaphoclimatic conditions, should be considered for future studies. These approaches lead to the conclusion that these two tools are useful for the maintenance of BAG-Caju (Cashew Germplasm Bank) and for the cashew-breeding program.

**Keywords:** Anacardium occidentale; fast phenotyping; NIR; UPLC-HRMS; chemometrics

#### **1. Introduction**

Cashew (*Anacardium occidentale* L.) is a fruit tree originating from the north of South America, with greater dispersion in the coastal regions, from the state of Rio de Janeiro to the Amazon, where it is possible to find spontaneous populations with great genetic and, consequently, phenotypic variability. To conserve part of this variability, Embrapa maintains a germplasm bank (BAG-Caju) holding almost seven hundred accessions with important variability [1].

Nuts are true fruits with hard shells that are inedible due to the presence of cardol and cardanol (caustic substances). The cashew nut consists of nut kernels that are edible and commercially known as cashew nuts. This complete fruit forms on the distal end of the false fruit that is also edible (called cashew apple or peduncle) [2]. Cashew nut (kernel) is the main product of this crop marketed globally. However, Brazil, by tradition and culture, has the habit of consuming the cashew apple (false fruit) in different ways, e.g., as fresh fruit, juice, or sweets. Thus, BAG-Caju that was initially established based on the size and weight of the cashew nut began to include the variability related to the false fruit, such as vitamin C, sugars, acidity, astringency, color, and bioactive compounds [3].

Due to the elevated number of cashew apple accessions, we have investigated the applicable potential of various technologies to develop simple, precise, and low-cost methods for studying the physico-chemical and nutritional features of this fruit. Among the technologies, the near infrared (NIR) spectroscopy is considered to be an economical, fast, and efficient technique for the evaluation of foodstuffs [4–6]. The NIR analysis produces an electromagnetic spectrum (reflectance or transmittance) between 780 nm and 2,500 nm, where the wavelength depends on the scattering and absorption processes according to the chemical compositions: molecular functional groups, C-H, N-H, S-H, or O-H bonds. A MicroNIR spectrometer is a portable NIR device for real-time, in situ, and non-destructive chemical and physical analyses [7].

In particular, the cashew apple exhibits beneficial characteristics to human health, such as antitumor, antimicrobial, urease inhibitory, and lipoxygenase activities [3,8–10]. The genetic enhancement of the quality and attributes of the cashew apple is still under development in Brazil, although the knowledge of genetic diversity for the construction of germplasm banks exists. Most cashew breeding programs are based on traditional selection approaches, including the size and weight of nuts or yield of a cashew tree, and efforts have also been applied in the prospection of dwarf genotypes with enhanced fruit quality and resistance to pests and diseases [11]. Therefore, the identification of the characteristics with economic interest, related to conserved or cultivated accessions, may be appropriated by breeding programs aimed at launching high-performance cultivars to meet specific demands, such as cultivars with high levels of bioactive compounds. The analysis of unique chemical fingerprints, based on ultra-performance liquid chromatography (UPLC), coupled with high-resolution mass spectrometry (HRMS) has been successfully applied for the study of fruit pulps. In such foodstuff studies, a large number of samples have to be screened for identifying metabolites and their classification according to the geographic origin, climate conditions, genotype, and cultural practices using multivariate statistics [3,12–15]. Usually, multivariate analyses are applied, in order to explore complex matrices, such as foodstuff to determine the variations and relationships among the compositions of the samples [16,17]. This untargeted approach is advantageous when the compounds and degradation products are not known.

From the foregoing studies, the aim of the present study was to evaluate the potential of a MicroNIR spectrometer to explore the composition of different genotypes of intact cashew apple from the Embrapa germplasm bank, as well as to create multivariate regression models considering ◦Brix, total acidity, and concentrations of ascorbic acid (vitamin C). Furthermore, due to the need to investigate plants that produce fruits with numerous compounds beneficial to human health, the study of metabolite profiling (nutritional and functional compounds) to reveal the variability of pulps from different genotypes of cashew apple was also investigated.

#### **2. Results**

The results were divided according to the analytical technique applied to evaluate the cashew apple variability: Section 2.1 for MicroNIR and Section 2.2 for UPLC-HRMS.

#### *2.1. Exploratory Multivariate Analysis of the MicroNIR Dataset*

The NIR analysis contains undesirable sources of error (intrinsic imperfections) for multivariate studies, and therefore, pre-treatments can correct many systematic or random errors. The information from the cashew apple composition was partly obscured (overlapped), and the distribution of the variables was highly skewed by usual and expected spectral imperfections. Three pre-treatments of the MicroNIR dataset were tested to assess the features regarding the cashew apple composition, minimizing irrelevant variations within the spectra to obtain quality data: Multiplicative scatter correction (MSC), standard normal variate (SNV), and first derivatives, using Savitzky–Golay filter with a second order polynomial for five points (Figure S1 in Supporting Information). Therefore, the pre-processing by MSC was chosen as an input for the following chemometric analysis.

To comprehend and classify the variability of the organic compounds in intact cashew apples according to the genotype, via detecting the tendencies related to the composition, a method based on hierarchical clustering was developed to segregate the samples in groups according to the composition similarity. Figure 1 presents a 3D dendrogram in heat map form: Genotypes with the harvest year in columns; wavelength in rows, and the signals intensities illustrated in colors. The relatively deep red represents the high relative intensity (wavelength); the relatively deep blue, the low relative intensity; and white, the intermediate intensity. Important tendencies for four cluster formations were observed in the heat map based on genotype dissimilarity (Figure S3 in Supporting Information). Cluster 4 presented the most dissimilar genotypes (zero similarity). The results reflected the natural differences among the groups, which were formed by different spectroscopic information from cashew genotypes pulp composition.

**Figure 1.** Three-dimensional (3D) dendrogram (sample × wavelength from MicroNIR × intensity) representing the chemical composition similarity relationships among the genotypes.

Multivariate Regression Analysis of the MicroNIR Dataset

Prior to performing the partial least squares (PLS) analysis, the ◦Brix value, total acidity, and concentrations of ascorbic acid (vitamin C) were determined in 31 cashew apples randomly chosen from the set of accessions. Three regression models were created for each dependent variable ( ◦Brix, total acidity, and ascorbic acid) with spectroscopic data to evaluate the association between the chemical variability, genotype and NIR profile. Sequential to these supervised regression modeling, the PLS by intervals, known as iPLS (interval PLS), was developed to maximize the covariance between the independent variables on the MicroNIR dataset (X matrix) and each dependent variable. This method realizes individuals PLS models for each pre-defined spectra intervals, optimizing the predictive capacity of the model, while assisting the interpretation by reducing the number of variables, thereby providing superior prediction capacity using all the variables [18]. Figure 2 illustrates the relevant intervals (highlighted in green color) for regression modeling, based on RMSECV using the Brix values (a) and total acidities (b). The modeling using the concentrations of ascorbic acid was weakly adjusted based on statistical parameters (data not shown).

**Figure 2.** Spectral average (black line) of the MicroNIR spectra from different genotypes of cashew. The relevant absorbance selected by the iPLS model (green) for the genotype discrimination based on (**a**) ◦Brix and (**b**) total acidity.

The merit graph obtained by regression modeling, using ◦Brix and total acidities, were evaluated to assess the quality of the calibration models, which are illustrated in Supplementary Information (Figures S4 and S5). The Hotelling's T2 <sup>×</sup> Q residuals graph indicates that the sample did not negatively influence the modeling [19]. The leverage plot revealed the influence of each genotype on models based on Hotelling's T2, and the studentized residuals (mean zero and unit variance) indicated the lack of fit of some quantitative parameters [20]. However, despite the high leverage and high residuals, the respective genotypes expressed a very low studentized Y residual, and therefore, the regression model was still able to sufficiently predict the ◦Brix and total acidity based on selected MicroNIR spectra (green region in Figure 2). The robustness of both prediction models (◦Brix and total acidity) was achieved by the proximity between both regression curves from calibration and cross-validation (red and green lines). Furthermore, the statistical parameters used to assess the modeling quality (Table 1) indicated a well-adjusted models for both ◦Brix and total acidity, according to the high total variance cumulated using three latent variables (LVs) (higher than 94%), low bias model, low calibration and cross-validation errors, and proximity between the calibration and cross-validation errors. Additionally, the root mean square error of calibration (RMSEC) and root mean square error of cross-validation (RMSECV) ratio, close to 0.75, is indicative of a well-adjusted model [21,22].

**Table 1.** Statistical parameters obtained by multivariate regression modeling of MicroNIR spectra with ◦Brix and total acidity using 3 LV.


<sup>1</sup> Percentage variance captured by the regression model; <sup>2</sup> Coefficient of correlation between the real and predicted values during the calibration; <sup>3</sup> Root mean square error of calibration; <sup>4</sup> Coefficient of correlation between the real and predicted values during the cross-validation; <sup>5</sup> Root mean square error of cross-validation; <sup>6</sup> Similarity criterion; <sup>7</sup> Average difference between the estimator and real values during the calibration; <sup>8</sup> Average difference between the estimator and real values during the cross-calibration.

#### *2.2. UPLC-HRMS*

Commonly, in UPLC-HRMS analysis, some organic compounds may preferentially ionize in the positive or negative ionization mode, as phenolic and carboxylic derivatives ionized well in negative ionization mode, while flavonoids and alkaloids ionize better in positive ionization mode. Therefore, the cashew apple pulps were analyzed under a negative ionization mode to screen the aforementioned compounds and differentiate each cashew apple pulp according to the genotypes. Due to the complexity of the chromatographic data, visual differentiation of the sample composition could not be achieved. Therefore, non-targeted multivariate analyses by hierarchical cluster analysis (HCA) and principal component analysis (PCA) were applied to comprehend the variability of the secondary metabolites in cashew apple pulps according to the genotype.

Initially, the unsupervised method HCA was applied: Segregating the pulps in groups according to similarity. Important tendencies for four cluster formations were observed in the dendrogram based on the genotype at a similarity index of 0.362 (Figure S3 in Supporting information). Cluster 4 presented the most distant samples included in the study, with zero similarity (as also observed in Figure 1). The results reflected the natural differences among the groups, which were formed as a function of pulp composition, dependent on the cashew genotype. In addition, the PCA method was applied to assist the modeling and interpretation of the multivariate data, with the scores graph presented in Figure 3a, and the respective loadings (plotted in lines) in Figure 3b. The tentatively identified biomarkers are presented in Table S1 (Supporting information). The main separation tendencies of the samples were observed with respect to PC1 and PC2 axes with a total explained the variance of 36.7%. The clusters observed in HCA were important for identifying the resultant grouping in the scores graph. The pulps were assigned symbols according to the clustering tendency: Blue triangles for negative scores of PC1 and PC2; red squares for positive scores of PC1 and negative scores of PC2; black stars for positive scores of PC1 and PC2; and green circles for positive scores of PC2. The samples with no relevant result were symbolized by gray circles considering the unrepresentative replicates according to the year.

**Figure 3.** Principal component analysis (PCA) multivariate analysis of UPLC-HRMS data from cashew apple pulps of different genotypes: a) PC1 × PC2 scores coordinate system for the cashew apple pulps from different genotypes; b) respective loadings plotted in lines form. The samples were assigned symbols according to the clustering tendency: blue triangles for negative values of PC1 and PC2; red squares for positive values of PC1 and negative values of PC2; black stars for positive values of PC1 and PC2; green circles for positive values of PC2. The samples that did not present relevant results were symbolized by gray circles.

Compounds with significant changes, based on genotypes according to the chemometric evaluation, and not exhibiting overlapping signals, were integrated (details in Experimental Section 4.2.2), and their variations were expressed as the relative contribution. The relative peaks areas were calculated for quantitative expression of the chemical properties, with the differences evaluated by ANOVA single factor. Figure 4 illustrates the relative contributions of the total ion abundance of the peaks in the pulps, since the relative amplitude of the peaks provided the relative population of the isotopic forms in the chromatograms. The results from the signal area of the base peak intensity (BPI) corroborated the PCA results, considering the deviation of the method from the three replicates of sampling during two years, which totaled 6 pulps for each genotype, with the hydroxybutanoic acid ethyl ester-hexoside (at 3.28 min) being responsible for the pulps clustering at the red group, and galloylhexose I (at 1.60 min) and digalloylhexoside I (at 2.82 min) for the pulps clustering in the blue group.

**Figure 4.** Relative contributions of the isotopic forms in chromatograms (UPLC-HRMS) for the compounds at retention times of 1.60, 2.82, 3.23, 3.82, and 4.25 min.

#### 2.2.1. Multivariate Classification Analysis of the UPLC-HRMS Dataset

Based on the non-targeted chemometrics results and the relative quantification, the classification modeling by partial least squares-discriminant analysis (PLS-DA) was employed to improve the association of the chemical variability of the pulps according to the cashew genotype, which is illustrated in Figure 5. The model presented a classification ability of 88.05% using 3 LVs, considering the year replicate. The statistical parameters used to assess the quality of the modeling (Table 2) indicated a well-adjusted classification, with an RMSEC/RMSECV ratio close to 0.75 (similar values) [22]. The low calibration and cross-validation errors expressed a suitable predictive performance of the model estimated as a function of the global error, samples leverage, and the sample residual X-variance. The low error values indicated the similarity between the pulps used for prediction and those used to make the calibration model.

**Table 2.** Parameters from partial least squares-discriminant analysis (PLS-DA) classification model of UPLC-HRMS data from cashew apple pulps of different genotypes.


<sup>1</sup> Total variance percent in X matrix refer to 3 LVs; <sup>2</sup> Coefficient of correlation between the real and predicted groups during the calibration; <sup>3</sup> Root mean square error of calibration; <sup>4</sup> Coefficient of correlation between the real and predicted groups during the cross-validation; <sup>5</sup> Root mean square error of cross-validation; <sup>6</sup> Similarity criterion.

**Figure 5.** PLS-DA classification model of UPLC-HRMS data from cashew apple pulps of different genotypes: (**a**) LV1 × LV2 × LV3 scores 3D plot; (**b**) respective loadings potted in lines form (b) from PLS-DA for classification of the cashew apple pulps. Legend: 1-galloylhexose I (1.60 min); 2-digalloylhexoside I (2.82 min); 3-hydroxybutanoic acid ethyl ester-hexoside (3.28 min); 4-myricetin-3-*O*-glucoside (3.83 min); and 5-myricetin-3-*O*-rhamnoside (4.25 min).

#### **3. Discussion**

It is known that NIR spectroscopy is a non-destructive, low-cost, and non-invasive procedure providing prompt results for a sample composition: Any molecule containing C-H, N-H, S-H, or O-H bonds. However, due to the intrinsic overlapping of the signals, pre-processing is mandatory. The MSC and SNV algorithms presented similar effects, which were considered as exchangeable (Figure S1 in Supporting Information). After the application of the MSC algorithm, the spectra were adjusted to present the same scatter level estimated by a mean spectrum. Conversely, the SNV algorithm treated each spectrum separately by absorbance autoscaling [23]. The derivative algorithm emphasized steep peaks and enhanced the overlapping peaks, as well as reduced the measurement variations, thereby improving the differentiation of the bands. The first and/or second derivatives are more commonly applied, and the second derivative NIR spectrum may result in sharp signals [24,25]. Therefore, the MSC algorithm was chosen for the development of the experiments.

According to the HCA-heat map (Figure 1), important tendencies for four main cluster formations were observed based on the genotype, even with differences in fruit morphology, such as shape, size, and peel color. Another disadvantage of NIR analysis was the penetration of radiation into the tissues of fruits, which skin may reduce the light penetration that decreases with the depth [26]. The results reflected the natural differences among the cashew apple groups formed as a function of the composition intrinsically related to the genotype. Cluster 4 presented the most dissimilar samples compared to those of the other groups. The main absorbance peaks, related to the clustering, were located between 1150 and 1340 nm, 1370 and 1850 nm, and 1900 and 2020 nm. The absorbance from 2040 to 2170 nm was characterized as indicative of non-relevant functional groups. The absorbance from 1400 to 1620 nm and 1900 to 2020 nm were characterized as the second and first overtone of the OH stretch, respectively. The absorbance from 1150 to 1340 nm and 1650 to 1850 nm were due to the C-H stretches related to the second, and first overtone, respectively, which may be from carbohydrates and other organic compounds present in the cashew apple skin and/or pulp. The band at 1340 nm is attributed to the CH group from cellulose [27]. Additionally, it was highlighted that the O-H group from monomeric organic acids presented the first overtone at 1445 nm; and a characteristic overtone around 1890 nm indicative of the O-H stretching combined with C-O stretching from organic acids [26,28]. However, it is known that all these absorption bands are close to the stronger water absorption regions, hindering the signals observation [29].

The iPLS evaluation highlighted the region between 1250 and 1400 nm as the most important for fruit discrimination, based on the calibration modeling by ◦Brix values and total acidity (statistical parameters, described in Table 1. The modeling for ascorbic acid was weakly adjusted based on statistical parameters. In particular, the absorbance range between 1570 and 1650 nm was important for the ◦Brix model, while the absorbance between 1800 and 1900 nm was important for the total acidity. The absorbance between 1650 and 1850 nm may be related to the C-H stretches from the second, and first overtone, respectively (Figure 1), which may be from carbohydrates and other organic species in cashew apple skin and/or pulp; and between 1800–1900 nm may be related to carbonyl and carboxyl groups from carboxylic acids [27]. The regression modeling ◦Brix values (a) and total acidities (b) were well adjusted based on statistical parameters, and the modeling for ascorbic acid was weakly adjusted. This may be because the MicroNIR was operated between 1150–2170 nm, and some organic acids found in fruits typically show bands from O-H group related to the second and third overtones approximately at 1000, and 800 nm, respectively, as well as starch and sugars to the second (920 nm) and the third (720 nm) overtones of O-H stretching, and the third (910 nm) and the fourth (750 nm) overtones of C-H stretching [28].

The relatively low correlation coefficients (r2) for both models (◦Brix values and total acidity) express the rather weak calibration performances estimated as a function of the global model error, samples leverage, and the sample residual X-variance. Therefore, the cross-validation results indicated that further parameters related to the fruits morphology (shape, size, and color), non-homogeneous distribution of particles into the fruits (density variations), and environmental factors that may affect the instrument performance as the illumination (since all the analysis was developed during the year), must be taken into account for improving the comprehension of the chemical and physical variability of the cashew apples from the germplasm bank of Embrapa. For instance, small physical variations from sample to sample may lead to light scattering that influences the MicroNIR measurement, resulting in baseline shifts and scaling variations (intensity variations), and consequently, disturbing the future predictions evaluation.

Prior to the chemometric analysis of the HPLC-HRMS analysis, a valuable feature of chromatograms acquisition was taken into account, since the retention times of the chromatographic peaks are sensitive to minor fluctuations in temperature, pH, flow, pump operation, etc. To solve the problem of small peaks shifts related to the same compound inter-chromatograms, some different peak alignment methods were tested. The alignment practice can be performed manually, using COW (correlation optimized warping) [30], or using the bucketing method, which reduces the chromatogram dimensionally by slicing it into equal sized regions [31], making it easier to analyze the respective loadings. Therefore, all the chromatogram peaks were previously aligned using the COW method, which is illustrated in Figure S2 in Supporting Information.

Significant composition variability was detected in PCA regarding cashew genotypes despite the low cumulated variance, which indicated the existence of further factors beyond the scope of this study, such as seasonality. An examination of the loadings provided evidence of the variables (compounds) responsible for the separations or clustering observed in scores. The signals from galloylhexose I (at 1.60 min, 331.0650 *m*/*z*), digalloylhexoside I (at 2.82 min, 483.0741 *m*/*z*), hydroxybutanoic acid ethyl ester-hexoside (at 3.28 min, 293.1242 *m*/*z*), myricetin-3-*O*-glucoside (at 3.83 min, 479.0826 *m*/*z*), myricetin-3-*O*-rhamnoside (at 4.25 min, 463.0875 *m*/*z*), and a mixture of unknown compounds (at 4.82 min) were responsible for the pulps placement based on genotypes. Minor compounds were irrelevant due to the pretreatments of mean centering applied over the samples [16]. According to Figure 3, the negative loadings of PC1 are the pulps symbolized by blue triangles (2001/3, 2001/6, 2005/122, B 963, and B967) with relatively high amounts of galloylhexose I and digalloylhexoside I. This latter compound is a trihydroxybenzoic acid derivative, which provides astringent flavor and can contribute to the characteristic bitter taste of immature cashew apple [32]. The pulps symbolized by red squares (CP 06, CP 09, B 393, and BRS 275) and black stars (2005/127, 2005/133, 2005/102, BRS 226, CP 76, 98/101) exhibited tendencies of having relatively high amounts of hydroxybutanoic acid ethyl ester-hexoside at a retention time of 3.28 min according to positive loadings of PC1. Hydroxybutanoic acid ethyl ester-hexoside has been previously described in melon fruit and it is considered a precursor of volatile compounds [33]. The presence of this compound has been associated with some amino acids, such as alanine, glutamine, isoleucine, phenylalanine, tryptophan, and tyrosine. Alanine has been reported as one of the key amino acids of the characteristic profile of cashew apple. Phenylalanine and tyrosine have also been detected but in small amounts [34]. The positive loadings of PC2 exhibited the tendency of the cashew apple pulps symbolized by green circle (2005/111, 2005/223, 2001/13, 98/116, and B 741) and black stars (2005/127, 2005/133, 2005/102, BRS 226, CP 76, 98/101) to have relatively high amounts of the flavonoids, myricetin-3-*O*-glucoside, myricetin-3-*O*-rhamnoside, and an unknown mixture of compounds between the retention times of 4.79 and 4.85 min. The presence of myricetin-derived and other flavonoids compounds in pulps may offer biological benefits, including the reduction of cardiovascular disease and risks of cancer [3], as well as antihyperglycemic property [35]. Furthermore, these compounds have been reported in methanol-water extracts of cashew apple [3]. The corroborative results between quantitative and chemometric analyses confirm the advantages of an untargeted multivariate analysis, since the compounds and degradation products are not always known, and it is sometimes difficult to find certified standards.

According to Figure 5, the LV1 axis was most relevant in clustering the cashew apple pulps in black color (2005/102, 2005/111, 2005/127, 2005/133, 2005/223, 2001/13, 98/101, 98/116, BRS 266, B 741, CP 76), and in separating them from those in red (CP 06, CP 09, B 393) and blue (2001/3, 2001/6, 2005/122, B 963, B 967). In addition, the LV2 axis was important in clustering the red samples at the positive scores, and LV3 was relevant in the separation of the red and blue cashew apple pulps. The interpretation of the loadings revealed that the pulps in black had higher amounts of galloylhexose I and digalloylhexoside I than those of in red and blue. The results corroborated that the pulps in red color have a higher amount of hydroxybutanoic acid ethyl ester-hexoside than those in the black and blue pulps; in addition, opposite behavior between the hydroxybutanoic acid ethyl ester-hexoside and galloylhexose I was presented. Finally, the LV3 axis confirmed the relatively high

amount of the flavonoids, myricetin-3-*O*-glucoside and myricetin-3-*O*-rhamnoside, in the pulps in blue and presented the opposite behavior between these flavonoids and hydroxybutanoic acid ethyl ester-hexoside. Therefore, the statistics data indicated that the model could be acceptable to classify new or unknown cashew apple pulps based on the main secondary metabolites.

#### **4. Materials and Methods**

#### *4.1. Sampling and Experimental Planning*

Based on experimental viability, 764 cashew apples (*Anacardium occidentale*, L.) from 24 different accessions were randomly harvested at the Embrapa experimental station (Pacajus, Ceará, Brazil) coordinates 4◦11 07"S; 38◦30 07"W; altitude: 70 m. The region has a tropical climate, average temperatures of 26 to 28 ◦C, and 1020 mm average annual rainfall. The soil is classified as Ultisol and has a sandy/medium texture with low organic matter content. The cashew samplings were collected between August and December in two years. Table 3 presents the different genotypes with their respective accession numbers and morphoagronomic characteristics.

**Table 3.** Illustration of the cashew genotypes associated with the accession number and morphoagronomic characteristics: plant size, tree appearance, fruits color and shape, and origin (county-state).



**Table 3.** *Cont*.


**Table 3.** *Cont*.


#### **Table 3.** *Cont*.

\* Legends: crop means registered product on market; breeding program means plant before crossbreed; germplasm means plant collected and conserved in the germplasm bank.

#### *4.2. Portable NIR Spectrometer Analysis*

The portable NIR (MicroNIR) analysis of the cashew apple composition was divided into two stages. First, all the 764 intact fruits were analyzed by MicroNIR. After that, 31 fruits were randomly selected for the determination of quantitative parameters, such as ◦Brix, total acidity, and concentration of ascorbic acid (vitamin C) to develop multivariate regression models.

The NIR experiment was acquired using a portable NIR spectrometer (MicroNIR 1700, Viavi, Milpitas, CA, USA), which operated in a range between 1150 and 2170 nm (spectral resolution of 10 nm,), with dimensions of 45 mm diameter × 42 mm high, two tungsten sources for reflectance measurements, and a continuous monochromator based on a linear variable filter. The parameters for

spectral data acquisition were set at 50 ms integration time and an average of 100 scans. The reference spectrum for the absorbance calculation was obtained from a piece of Spectralon™, and the dark signal was obtained by pointing the measurement window of the instrument to the ambient environment.

#### 4.2.1. Determination of ◦Brix, Total Acidity, and Concentration of Ascorbic Acid

The ◦Brix, total acidity, and concentration of ascorbic acid (vitamin C) were determined in 31 cashew apples randomly chosen from the set of accessions. These experiments were carried out to use the quantitative results as categorical variables (Y column) to develop multivariate regression models by partial least square (PLS) analyses by maximizing the covariance between X matrix (NIR spectral data) and Y responses (◦Brix, total acidity, and ascorbic acid as dependent variables).

The ◦Brix (concentration of sucrose w/w) was determined following the AOAC method (2005) [36] (soluble solids content), which was obtained by refractometry using a digital refractometer (ATAGO™ N1, Kirkland-WA-USA) with automatic temperature compensation.

The total acidity was determined as follows: Total titratable acidity (TTA) determined by titration with NaOH solution (0.1 N) in 1 g of the pulp diluted to approximately 50 mL of distilled water, containing 3 drops of 1% phenolphthalein until pink coloration, was observed. The results were expressed as the percentage of malic acid according to IAL (1985) [36].

The concentrations of ascorbic acid were determined by titration with 0.02% 2,6-dichloro-indophenol (DFI) as reported by Strohecker and Henning (1967) [37]. One gram of pulp was diluted to 100 mL with 0.5% oxalic acid and homogenized. Subsequently, 5 mL of this solution was diluted to 50 mL with distilled water and titrated. The results were expressed as mg·100 g−<sup>1</sup> FW (fresh weight).

#### 4.2.2. Chemometric Analysis of the MicroNIR Dataset

Different multivariate approaches were performed on the numerical matrix from 764 cashew apple fruits (Section 2.1). The averaging method was applied on those fruits from the same accession (from four to six replicates), resulting in 135 mean spectra. The spectral region between 1150 and 2170 nm was used for the modeling, and a matrix with the dimensionality of 16,875 data points (135 spectra × 125 variables into each spectrum) was generated. The samples were named according to the accession number and year of harvest.

For numerical matrix construction, each spectrum was converted to an American Standard Code for Information Interchange (ASCII) file and imported by the Origin™ program (version 9.4). To reduce the dimensionality of the original data and to assist the interpretation of the multivariate dataset, the matrix was averaged along variables by a factor of 2 using the PLS-Toolbox™ program (version 8.6.2, Eigenvector Research Incorporated, Manson, WA USA), and imported by GENE-E program (https://software.broadinstitute.org/GENE-E/index.html) for pattern recognition through the hierarchical clustering algorithm by heat map. The Euclidean distance was used to measure the proximity between the samples (columns), and the average linkage method (sum-of-squares approach in calculating intercluster distances) was applied. The results were presented as heat maps (three-dimensional (3D) dendrogram = sample × wavelength × intensity) [22].

In addition to the unsupervised analysis, supervised methods by PLS were developed using the ◦Brix values, total acidity, and concentrations of ascorbic acid previously calculated in Section 2.2.1 to improve the identification of chemical changes according to genotypes by MicroNIR. The simplified PLS (SIMPLS) algorithm was applied to build the models and the LVs were selected in accordance with the statistical parameters based on the full cross-validation method: RMSEC, RMSEV, and calibration and cross-validation coefficients (r2) [16].

#### *4.3. UPLC-HRMS Analysis*

Due to the health benefits of the cashew apple, additional experiments were developed in parallel by non-targeted UPLC-HRMS analysis. A total of 24 different genotypes of cashew apple (as described in Section 2.1) were evaluated. The cashew apples were manually pressed to obtain the resultant pulp, followed by centrifuged for 5 min at 804.6 g (IEC clinical centrifuge, Damon/IEC Division, Needham, MA, USA). The samples were preserved at −80 ◦C until the analysis.

Prior to the UPLC-HRMS analysis, the samples were filtered using PTFE membranes of 0.22 μm. The analysis was performed on an Acquity system (Waters) coupled with quadrupole/TOF (Waters) equipped with an ESI source operated in the positive ion mode. The chromatographic separation was performed using Waters Acquity UPLC BEH (150.0 × 2.1 mm, 1.7 μm) column with the temperature set at 40 ◦C. Water and acetonitrile were used for the mobile phase, both with 0.1% formic acid. The gradient ranged from 2% to 95% of water in 15 min in a flow of 0.4 mL·min−<sup>1</sup> and injection volume of 5.0 μL per sample. The desolvation gas was N2. The desolvation temperature was set at 350 ◦C at a flow rate of 350 L·h−<sup>1</sup> and a source temperature of 120 ◦C. The capillary voltage was set to 3200 V. The collision energies/cone voltages were set at 6 eV/15 V (low) and 30–50 eV/30 V (high) to achieve sufficient fragmentation. Data were collected using the negative ionization mode between 100 Da and 1180 Da, and the mode tandem was MSE.

#### 4.3.1. Chemometric Analysis of the HPLC-HRMS Dataset

Chemometric analysis was performed on the numerical matrix from 24 cashew apples harvested in duplicates this year, and analytical triplicate, resulting in 144 chromatograms. The chromatographic region between 0.65 and 7.12 min was selected. The samples were named according to the accession numbers (Table 3).

The same procedure applied for numerical matrix construction from MicroNIR (Section 4.2.2) dataset was also applied for the UPLC-HRMS dataset. Therefore the chromatograms were converted to an American Standard Code for Information Interchange (ASCII) file and import by the Origin™ program (version 9.4) in order to build the matrix. The final matrix was exported for chemometric analyses by HCA, PCA, and PLS-DA using PLS Toolbox™ program (version 8.6.2, Eigenvector Research Incorporated, Manson, WA, USA).

The normalized scaling parameter and baseline correction, using linear fit algorithms, were applied over the variables, and mean-centered processing was applied over the samples, which reduced the noise and minor analytical errors [38,39]. For HCA, the matrix was mean-centered, and the incremental linkage method (sum-of-squares approach in calculating the inter-cluster distances) was applied. The Euclidian distance was used for distance metric. The PCA was performed using singular value decomposition (SVD) algorithm. To improve the identification of the chemical constituents associated with cashew genotype, a supervised method by PLS-DA was employed using the SIMPLS algorithm. The number of LVs were selected in accordance with the statistical parameters: RMSEC; RMSECV; calibration and cross-validation coefficients (r2); and similarity criterion RMSEC/RMSECV ratio above 0.75 [16,21].

#### 4.3.2. Relative Contribution

The peaks detected as exactly as possible, in both m/z and retention time, were used for determining the peak area for achieving the relative contribution of the compounds with less overlapped signals in the chromatograms. The relative contribution of the areas was calculated based on the total ion abundance from the peaks in the samples, since the relative amplitude of the peaks provides the relative abundance of the isotopic forms in the chromatograms. Therefore, the normalized means in the base peak intensity (BPI) at the retention times of 1.60, 2.82, 3.28, 3.83, and 4.25 min were determined.

The results were evaluated using the analysis of variance ANOVA single factor (significance level of 0.05; means comparison using Tukey test; Levene's test for the homogeneity of the variance) to statistically certify the differences among the relative contributions. The deviation of the method was estimated based on the null hypothesis (*p*-value) from the three replicates of sampling for two years, totaling 6 samples for each cashew genotype.

#### **5. Conclusions**

It was demonstrated that MicroNIR spectroscopic analyses of the cashew apple composition provided a non-destructive and low-cost method for obtaining prompt results. In addition, important composition tendencies were observed with four fruits clusterings according to their composition similarity, and genotype, even considering the morphologic differences, including shape, size, and color. The multivariate regression results, obtained using ◦Brix and total acidity, showed that it is possible to satisfactorily predict ◦Brix and total acidity within the cashew genotypes. However, the parameters related to the fruit composition, and the environmental factors that affect the instrument performance must be taken into account to improve the comprehension of the chemical and physical variability of the cashew apples from the germplasm bank of Embrapa.

Additionally, the chemometrics evaluation of the UPLC-HRMS dataset was suitable to follow changes in the composition of cashew apple pulps according to genotype. The current study resulted in the identification of relatively high amounts of different bioactive compounds, including galloylhexose I, digalloylhexoside I, hydroxybutanoic acid ethyl ester-hexoside, and the flavonoids, myricetin-3-*O*-glucoside and myricetin-3-*O*-rhamnoside, in different genotypes. This information is useful for breeding programs to establish accessions with higher concentrations of important compounds for human health.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/7/121/s1. Figure S1: (a) Raw absorbance spectra from 135 cashew apples obtained using the portable NIR spectrometer, and the same spectra after the following treatments, (b) MSC, (c) SNV, and (d) first derivative using the Savitzky–Golay filter with a second order polynomial for five points. Figure S2: The total ion chromatograms from 24 different genotypes of cashew apple pulps acquired under negative ionization mode. Figure S3: Dendrogram representing the chemical composition similarity relationships among the cashew apple pulps. Figure S4: Regression modeling using the MicroNIR dataset from different genotypes of cashew apples based on ◦Brix values: a) influence plot of Hotelling's T<sup>2</sup> <sup>×</sup> Q residuals, (b) leverage <sup>×</sup> studentized residuals, c) Y calibration × cross-validated Y with 95% confidence limits, and d) scores on LV1 × LV2. Figure S5: Regression modeling using the MicroNIR dataset from different genotypes of cashew apples based on total acidity: a) influence plot of Hotelling's T2 <sup>×</sup> Q residuals, (b) leverage <sup>×</sup> studentized residuals, c) Y calibration <sup>×</sup> cross-validated Y with 95% confidence limits, and d) scores on LV1 × LV2.

**Author Contributions:** All authors provided critical feedback and helped shape the research, analysis, and manuscript writing. Particularly, the researchers E.d.B. (Chemist) and E.S. (Agronomist) conceived the presented idea, theory development, and investigated the study viability. The researchers S.M. (Chemist), G.Z. (Chemist) and K.C. (Chemist) helped to plan and supervise the project. The researchers A.C.C. (Biologist) and E.S. (Agronomist) planned and designed the agronomical experiments. The authors L.M.S. (Chemist) and P.R. (Chemist) verified and developed the analytical methods. The author Y.L. (graduate) helped in chemical experiments and contributed to sample preparation. The author E.A.F. (Chemist) developed the chemometric experiments and wrote the manuscript together with L.M.S. under the E.d.B supervising.

**Funding:** This research was funded by CNP*q*, grant number 314737/2018-9; and FUNCAP, grant number DCR-0024-01686.01.00/15.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Comparative Metabolomics and Molecular Phylogenetics of Melon (***Cucumis melo***, Cucurbitaceae) Biodiversity**

**Annick Moing 1, J. William Allwood 2, Asaph Aharoni 3, John Baker 4, Michael H. Beale 4, Shifra Ben-Dor 3, Benoît Biais 1, Federico Brigante 5,6,7, Yosef Burger 8, Catherine Deborde 1, Alexander Erban 5, Adi Faigenboim 8, Amit Gur 9, Royston Goodacre 10, Thomas H. Hansen 11, Daniel Jacob 1, Nurit Katzir 9, Joachim Kopka 5, Efraim Lewinsohn 9, Mickael Maucourt 1, Sagit Meir 3, Sonia Miller 4, Roland Mumm 12, Elad Oren 9, Harry S. Paris 9, Ilana Rogachev 3, Dominique Rolin 1, Uzi Saar 9, Jan K. Schjoerring 11, Yaakov Tadmor 9, Galil Tzuri 9, Ric C.H. de Vos 12, Jane L. Ward 4, Elena Yeselson 8, Robert D. Hall 12,13,**† **and Arthur A. Scha**ff**er 8,\*,**†


Received: 27 February 2020; Accepted: 20 March 2020; Published: 24 March 2020

**Abstract:** The broad variability of *Cucumis melo* (melon, Cucurbitaceae) presents a challenge to conventional classification and organization within the species. To shed further light on the infraspecific relationships within *C. melo*, we compared genotypic and metabolomic similarities among 44 accessions representative of most of the cultivar-groups. Genotyping-by-sequencing (GBS) provided over 20,000 single-nucleotide polymorphisms (SNPs). Metabolomics data of the mature fruit flesh and rind provided over 80,000 metabolomic and elemental features via an orchestra of six complementary metabolomic platforms. These technologies probed polar, semi-polar, and non-polar metabolite fractions as well as a set of mineral elements and included both flavor- and taste-relevant volatile and non-volatile metabolites. Together these results enabled an estimate of "metabolomic/elemental distance" and its correlation with the genetic GBS distance of melon accessions. This study indicates that extensive and non-targeted metabolomics/elemental characterization produced classifications that strongly, but not completely, reflect the current and extensive genetic classification. Certain melon Groups, such as Inodorous, clustered in parallel with the genetic classifications while other genome to metabolome/element associations proved less clear. We suggest that the combined genomic, metabolic, and element data reflect the extensive sexual compatibility among melon accessions and the breeding history that has, for example, targeted metabolic quality traits, such as taste and flavor.

**Keywords:** genetic resources; melon; genotype by sequencing; elemental analysis; metabolome; *Cucumis melo*

#### **1. Introduction**

*Cucumis melo* L., melon, is a phenotypically highly variable species with respect to fruit characteristics [1]. Melon fruits vary not only in size and shape but also in the accumulation of various metabolites, the most obvious of which include the horticulturally important metabolites of external and internal pigmentation, volatiles responsible for fruit aroma, and carbohydrates and organic acids accounting for sweetness and acidity [2]. Melon fruit range in sizes up to 20 kg; in shape from spherical to very long; in taste from insipid to sweet, acidic, or bitter; have external colors of green, yellow, orange, and red, with internal flesh colors of white, green, orange, or cream; and encompass a broad range, from highly aromatic to almost non-aromatic types. The species is unique in that it has representatives of both climacteric and non-climacteric ripening physiology [3], which further impacts on the metabolite components of the ripe fruit.

The extreme fruit variation within *Cucumis melo* does not easily lend itself to conventional infraspecific classification. Kirkbride [4] suggested that the species should be considered to consist of two subspecies, based on the pubescence of ovaries and young fruits. Accordingly, ovaries and young fruits having appressed short hairs were assigned to *C. melo* subspecies *agrestis* and those having pilose or lanate, spreading, long hairs were assigned to *C. melo* subsp. *melo*. However, it is often not easy to reconcile young fruit pubescence with the melon fruits seen in marketplaces in various parts of the world. Although markets in some regions feature young cucumber-like melons, most markets feature mature ripe sweet dessert melons, others ripe and highly aromatic but insipid "duda'im" melons, and yet others fully grown but unripe "snap" melons [1,5].

Attempts to classify melons according to fruit characteristics date to the first half of the 19th century, and over the past two centuries, a great number of proposed infraspecific classifications for melons have accumulated (reviewed in [6]). Among these proposed classifications are "lumpers", which have not indicated enough divisions within the species to adequately reflect its wide variation, and "splitters", which have fragmented the species into an unwieldy number of units. To be feasible, any useful infraspecific classification must at once be reflective of genetic relationships and also be easily observed by those for whom the classification is intended, which should be a wide international audience that includes scientists and non-scientists alike. For melons, this means that the characteristics of fruits should serve as the basis for any such classification. Given all of the above considerations, our view is that the infraspecific breakdown to 16 varietas first proposed by Pitrat et al. [6] or cultivar-groups

("Groups") (adopted by Burger et al., [2]) is currently the most useful. Groups are lower ranking than the subspecies, but each Group can consist of more than one market type.

The economically most important Groups are the Reticulatus (climacteric, netted), Cantalupensis (climacteric, non-netted), Inodorus (non-climacteric), and Ameri (climacteric, dryland cultivation, Asian), all of which have sweet flesh when ripe and belong to subsp. *melo*. Other Groups include the Flexuosus (snake melon), Duda'im (pocket melon), Momordica (snap melon), and Khandalak, and the East Asian Conomon (Oriental pickling), Makuwa, and Chinensis Groups. However, even the placement of the various Groups into the two subspecies is not fully agreed upon, as may be expected from a somewhat subjective classification based on the single trait of ovary pubescence. The recent attempts at making order within the melon species, at the subspecies level, literally focused on hair-splitting. For example, while the *agrestis* subspecies proposed by Pitrat et al. [6] encompasses five Groups (Conomon, Makuwa, Chinensis, Momordica, and Acidulus), the *agrestis* as described by Decker-Walters et al. [7] combines Momordica, Conomon, Duda'im, and Chito. The latter two Groups, Duda'im and Chito, were placed in subsp. *melo* by Pitrat et al. [6], and Zhao et al. [8] recently traced the Momordica Group to subsp. *melo*. Furthermore, the distinction between Reticulatus (climacteric, netted) and Cantalupensis (climacteric, non-netted) based on netting is particularly unclear since the netting phenotype is expressed as a continuum of what is basically a quantitative trait. Thus, semi-netted Charentais-type melons have vacillated between the two infraspecific Groups [6,9,10].

The difficulty of clear infraspecific classification of the species is further confounded by the ease of crossbreeding within the species, as well as the undocumented history, both early and more recent, of most accessions, and the widespread occurrence of feral melons. Even the natural range of melon species has been the subject of ongoing studies. Traditionally, melon has been grouped with other *Cucumis* spp. associated with African origin [4]. However, recent studies indicate a closer relationship to an Australian/Asian clade, suggesting that *C. melo* has origins in Asia [11,12]. Inodorous Group melons have a Central Asian origin, traceable to the mid-9th century and had arrived in Spain by the 11th century [13]; Duda'im Group melons have been traced to mid-9th century Persia [14]; and Flexuosus Group melons have been traced back 4000 years to ancient Egypt [15,16].

Molecular-based phylogenetics contributed to the infraspecific classification of *C. melo*, consistently supporting a subspecies separation, with small differences in the placement of some of the Groups [17,18]. Recently, Endl et al. [19] pointed to the complexity within the *agrestis* subspecies and showed that there is likely a complex African/Asian/Australian origin of the *agrestis* types of *C. melo*, based on polymorphisms in seven DNA regions. The most recent and encompassing phylogenetic classification of the species was reported by Zhao et al. [8], which corroborated the subspecies classification but placed the Indian Momordica Group within the *melo* subspecies. Interestingly, Leida et al. [20], based on ca. 200 single nucleotide polymorphisms (SNPs), identified what they refer to as Spanish and European Inodorus Group landraces within the *agrestis* subspecies, distinct from the well-known cultivated Inodorus Group, suggesting that the non-climacteric genetic trait may have evolved independently multiple times.

Metabolomic-based phylogeny, or chemosystematics, has regularly been attempted from the early part of the last century [21], and was especially popular in the 1970s. However, these studies focused on intra-family classification at the species level and were generally based on the measurement of individual components of single biochemical families, frequently alkaloids, as allowed by the technologies of the period. For example, the chemical systematics of the family Rutaceae and the order Rutales attracted much research attention. This chemosystematic classification has been compared to the phylogeny determined by molecular polymorphism strategies and found to be generally confirmatory [22]. However, chemical systematics has not been attempted at the infraspecific level for the obvious reason that a limited number of chemicals are unlikely to be adequate for describing the extent of varietal differences. Furthermore, relatively minor differences due to few, or even single, genes may still be causal of large differences in metabolic components. For example, the genetic polymorphism controlling ripening physiology in melons, whether climacteric or non-climacteric, although essentially a simply inherited trait [3], will likely be accompanied by major changes in the primary and specialized metabolomes.

While small subsets of metabolites may not be useful for infraspecific systematics and phylogenetics, the number of metabolites available that can be analyzed by large-scale metabolomic strategies has potential discriminatory power. The advances in metabolomics technologies now allow for these techniques to be used and to attempt to analyze infraspecific relationships [23]. Recently, we described a metabolomic comparison of two melon cultivars [24], and the results were encouraging regarding the relevance of such analyses to a larger infraspecific classification. Furthermore, we performed a targeted metabolomic characterization of 77 metabolites, primarily of quality-determining volatiles, sugars, and carotenoids, of a novel segregating genetic population derived from random introgressions between two distantly related melon accessions [25]. The results indicated the potential of metabolomic analyses of genetic variability for the discovery of associations between metabolites and metabolic pathways. Esteras et al. [26] also recently surveyed the volatile and carotenoid profiles of broad genetic variability of melon and emphasized the value of unidentified and untapped diversity for melon fruit quality improvement.

Considering the presence of hundreds of thousands of primary and specialized metabolites in the plant kingdom, no single metabolomics platform is capable of describing the plant metabolome; however, the combination of multiple platforms can approach that goal. In this study, the metabolite profile of 51 melon accessions, representing the broad genetic variability in the species, was characterized in depth using an orchestra of metabolomic strategies. The metabolomic technologies used comprised flow injection electrospray mass spectrometry (FIE-MS) fingerprinting of semi-polar extracts, untargeted proton nuclear magnetic resonance (1H-NMR) profiling of polar or semi-polar extracts, liquid chromatography coupled to QTOF (quadrupole time-of-flight) mass spectrometry (LC-QTOF-MS) of semi-polar compounds, gas chromatography coupled to mass spectrometry (GC-MS) of polar extracts, solid-phase microextraction (SPME) GC-MS of volatiles, and inductively coupled plasma mass spectrometry (ICP-MS) of mineral elements [27]. This broad array of technologies collected over 80,000 metabolite signatures, i.e., molecular features.

The analyses detailed in this report are based on the comparison of the infraspecific classification as determined by metabolomic profiling of an extensive collection of accessions representing most of the cultivar-groups of *Cucumis melo*, with the classification derived from extensive comparative genotyping. The unique combination of >80,000 specific molecular features and >20,000 genetic polymorphisms obtained in this study allowed us to compare the inferred infraspecific phylogenies derived from each of the strategies and to test the relationships between metabolomic and genetic distance in this complex widely cultivated species.

#### **2. Results**

#### *2.1. Infraspecific Structure of Cucumis melo*

Fifty-one accessions of *Cucumis melo*, representing the cultivar-groups Reticulatus, Cantalupensis, Inodorus, Ameri, Flexuosus, Duda'im, Momordica, Khandalak, Conomon, Makuwa, and Chinensis, were grown in Israel during a spring-summer season and flesh and rind samples of mature fruit were harvested and distributed to the collaborating laboratories for the respective metabolomics analyses (Figure 1, Table 1; Table S1). In our classification into subspecies and cultivar-groups, we followed the proposed classifications of Pitrat et al. [6] and Burger et al. [2], with the exception of the placement of the single Momordica accession PI414723, which we included among the subspecies *melo*.

In parallel, 44 accessions could be genotyped by sequencing (GBS), which, in brief, is based on high-throughput sequencing of restriction enzyme fragments of sample DNA. For the present study, GBS provided 23,931 informative SNPs, which were selected for genome-wide analyses and phylogenetic classification. The accessions not included in the GBS analysis were provisionally genotyped by direct sequencing of over 11,000 bps, derived from 20 PCR reactions, yielding 116 informative SNPs (Figure S1).

**Figure 1.** Fruits of the accessions used for this study, representing most of the cultivar-groups of *Cucumis melo* (listed in Table 1 and Table S1): Cantalupensis, Reticulatus, Inodorus, Ameri, Flexuosus, Dudaim, Momordica, Khandalak, Conomon, Chinensis, and Makuwa.

**Table 1.** List of melon accessions used in this study. All accessions were analyzed for metabolites. The seven accessions not included in the GBS analysis are noted as NI. Further details regarding the accessions, including classification of climacteric behavior, are presented in Table S1.



**Table 1.** *Cont*.

Based on 23,931 genetic polymorphisms, the 44 accessions could be classified into two well-defined clusters (referred to as I and II) clearly distinguishing between the subspecies *agrestis* and *melo* (Figure 2). A more distant accession designated Qishu Meshullash (QME), either *Cucumis trigonus* or *C. colossus*, both of which have been included in *agrestis* by Endl et al. [19], was included in the GBS analysis as an outlier. The smaller cluster I consists entirely of the subsp. *agrestis* accessions of the Chinensis, Makuwa, and Conomon Groups. Within the Conomon Group, accessions FRC and TOG are weakly separated from the accessions of the Chinensis and Makuwa Groups, which are interspersed among themselves. The Momordica and Duda'im accessions are distinct from the *agrestis* subspecies but are also separated from the rest of cluster II, which contains accessions of subsp. *melo* and consists of a number of sub-clusters (Figure 2). Ten accessions of the Inodorus Group form a sub-cluster IIa1, which is distinct

from, but closely related to, sub-cluster IIa2 comprising the Kirkagac-type Inodorus accessions and the two Ananas-type Ameri Group accessions. The Flexuosus Group accessions (IIa3) also Group together and are most closely related to the Inodorus and Ameri accessions. An additional cluster (IIb) includes both the green-fleshed Ha'Ogen-type Cantalupensis accessions (IIb1) and the closely related, yet distinct, green-fleshed Galia-type Reticulatus (IIb2). A third cluster IIc consists of two sub-clusters of orange-fleshed cultivars, one of which consists of the five orange-fleshed American accessions of the Reticulatus Group (IIc1) and, the other of four European accessions of the orange-fleshed Charentais-type Cantalupensis Group. The most unexpected placement was that of BEL, which we assumed to be a Charentais-type Cantalupensis melon. However, it was genetically distinct from the other subspecies melo types in clades IIa,b,c.

**Figure 2.** Phylogram representing the phylogenetic relationships of the 44 accessions of *C. melo* used for GBS analysis in this study. The feral QME serves as outlier. Bootstrap values are based on 100 iterations.

The seven accessions, which were solely characterized metabolomically, are PI2015801, STA, PI435288, BSK, PMR, PI149169, and CRE, and these were classified based on 116 informative SNPs as shown in Figure S1 and Table S2. PI2015801 is a Kirkagac type of Inodorus and the PCR-based sequencing placed it near the Kirkagac-type PI334107. CRE is a unique orange-fleshed Inodorus accession and was genotypically closer to the two Kirkagac Inodorus accessions than to the more typical Inodorus accessions, HDG and PSR. BSK was classed together with the two other Conomon accessions, FRC and OHG. PI435288 has the appearance of a characteristic long-fruited Flexuosus type (Figure 1) and was genotypically classed as being closely related to the FAQ variety. PMR was most similar to the American netted Reticulatus accessions, both in appearance and based on the genetic polymorphisms. STA and PI149169 classed closely together, and related to PI435288 and FAQ.

#### *2.2. Combined Analysis of All Metabolomic and Elemental Data*

We analyzed fruit flesh and rind samples of the 51 melon accessions (Table 1). An independent duplicate of one accession, namely subsp. *melo* Duda'im (DUD), was included for control purposes, i.e., for internal validation of profile similarity.

A total of >80,000 molecular features were collected from the combination of nine analytical profiling strategies applied to the samples of melon flesh and rind, yielding more than 36,000 and 46,000 features for the flesh and rind, respectively (Table 2). The vast majority of features were collected by the two non-targeted MS-TOF techniques, UPLC profiling of semi-polar metabolites, and GC profiling of a polar fraction enriched for primary metabolites, including taste-relevant sugars and organic acids. For the metabolic classification purpose of this study, these molecular features remained non-annotated. However, for the selection of molecular features that were relevant to build a classification model of melon accessions (refer to Section 2.4.), we used a subset of annotated features [27]. In addition, three global variables of fruit quality were included, percent dry matter (%DM), total soluble solids

(TSS), and pH, with mean values of 11.5 %DM, 9.7 ◦Brix, and pH 5.9, These showed considerable variability among the accessions as illustrated by their coefficients of variation of 10.4%, 26.2%, and 11.0%, respectively (*n* = 52, Table S1, Figure S2).

**Table 2.** Summary of the metabolome and elemental data measured using MS or NMR analytical strategies in extracts of fruit flesh or rind samples from of 52 melon accessions.


We combined the data sets of the profiled molecular features (Table 2) after 0.1–0.9 quantile range normalization of each of the sub-datasets, separately. Unique features of melon accessions were given high weight in our analyses through missing value substitution by zero. We combined all normalized sub-datasets into a single matrix of melon accessions characterized by all available molecular features of both flesh and rind samples.

Preliminary analyses by principal components analyses (PCA) and independent components analyses (ICA) verified the complex nature of the combined dataset and revealed that only 23.3% of the total variance was represented by the first three principal components. ICA of the first three principal components (Figure 3A,B) relaxed the orthogonality criterion of PCA and instead optimized independent component kurtosis as a measure of bi- or multi-modality of the scores' values of the resulting ICA axes [28]. The ICA scores plot array indicated separate clusters of subsp. *agrestis* and subsp. *melo*, with full or partial separation of respective accession Groups.

To transform the multi-dimensional ICA clusters of melon accessions into a mono-dimensional tree structure comparable to conventional genetic distance representations, e.g., by phylogenic or genetic distance trees, we created a covariance matrix of the complete accession profiles. For this purpose, we did not preselect features and did not apply dimension reduction methods to consider the complete molecular variance reflected by the data set. The covariance matrix was subsequently subjected to hierarchical cluster analysis (HCA) and bootstrapped to generate an HCA support tree (Figure 3C), with the node confidence validated by the bootstrap values. These values ranged from 2 to as high as 92 on a scale of 0–100. High bootstrap support was given to the terminal nodes of similar melon accessions with two of the Duda´im profiles giving an estimate of a bootstrap threshold of 82 that was indicative of the level of metabolic similarity of genetically identical melon accessions. The basal nodes of the HCA tree, however, received low bootstrap support. These results indicated that alternative basal classifications may be present in the data set.

**Figure 3.** Metabolic classification of *Cucumis melo* accessions using the combined dataset of melon flesh and rind analyzed by multiplexed metabolome and elemental profiling. (**A**,**B**) ICA of the first three PCs obtained from the complete data set of >80,000 molecular features. The first three PCs comprised 23.3% of the total variance of the data set. (**C**) Covariance matrix of the complete molecular profiles of *C. melo* accessions, with HCA of the matrix using covariance distance metrics and complete linkage for clustering. HCA included support analysis by bootstrapping with 1000 iterations. Bootstrap support values of all HCA nodes and a scale of node height are included. Boxed regions of the covariance matrix indicate examples of additional alternative metabolic associations described in detail in the text. Cluster cones were arbitrarily set at the 75% tree height.

This interpretation was supported by alternative highly correlated groups of melon accessions that became apparent in the covariance matrix (Figure 3C). Multiple instances of alternative associations were apparent. One example was provided by the accessions of the Cantalupensis Group. This Group split between two HCA sub-branches, but alternative associations of the two Cantalupensis splits were also detected. In one HCA branch, the Cantalupensis accessions OGE, NOY, and PH406 clustered with the Reticulatus accessions KRY and MAK and the Ameri accessions AYO and EDO. In the second HCA branch, the remaining Cantalupensis accessions clustered with Inodorus accession CRE, Khandalak INB, and Reticulatus BES.

We concluded that the reduction of all paired metabolic similarities into a tree structure perhaps only reflects in part the complex metabolic similarities that were generated by the melon breeding process. This breeding process may have included crossbreeding beyond the Group or subspecies boundaries within *Cucumis melo*. Such breeding events may cause genetic introgressions with transfers of large gene sets. In addition, breeding for metabolic quality traits may have caused partial convergence of metabolic profiles from genetically distant accessions. We found a matrix representation of accession similarity more adequate in representing a complex intra-species breeding history. Consequently, we took this approach to also compare and correlate metabolic to genetic distance matrices instead of analyzing tree similarities only (Figures S3 and S4).

The highly supported nodes of our current HCA tree (Figure 3C) were useful to evaluate the closest metabolic neighbors among the GBS-characterized and the seven PCR-characterized accessions (Table 1). The Conomon cultivar BSK (Black Skin) of subsp. *agrestis* grouped at a medium bootstrap support of 44 with TOG, a Conomon Group accession of subsp. *agrestis*. A higher-level metabolic cluster with a bootstrap support of 32 contained all five Flexuosus accessions of subsp. *melo*, including non-characterized accession STA, PI435288, and accession PI149169 of undecided Group affiliation. A well-supported cluster consistently contained 13 Inodorus accessions. This cluster included both Kirkagac-type accessions, PI201581 and PI334107, and their closest neighbor, GOB. As reported above, non-characterized CRE, provisionally classified as Inodorus, grouped to a high-level cluster of diverse subsp. *melo* accessions and was metabolically most similar to INB of the Khandalak Group. Finally, PMR membership to the Reticulatus Group was confirmed with nearest neighbor TPM (Figure 3C). PMR and TPM had high similarity to a highly supported (bootstrap value 68) cluster containing three other Reticulatus accessions.

#### *2.3. Platform-Specific Metabolomic Analyses of Melon Flesh or Rind Tissue*

For each analytical strategy, an HCA of accession samples was separately performed, based on the range-normalized levels of all features detected per strategy (Table 2). This HCA was performed for both the flesh (eight technological platforms, Figure S3) and rind (six technological platforms, Figure S4) of the fruits, and the similarity of the grouping of the 52 accession samples based on the observed variation in their metabolome or element composition was compared to the results based on the variation in their genome (Figure 2, Table S3).

For flesh samples (Figure S3), the HCA showed a partial or complete clustering, depending on the analytical strategy, of accessions from the Inodorus Group. Likewise, the accessions of the Flexuosus Group were closely clustered, except for data from the NMR profiling and volatiles SPME GC-MS platforms. Grouping of Cantaloupe accessions was also clear, with the four analytical strategies detecting specialized metabolites (semi-polar extracts and volatiles) as well as for 1H-NMR fingerprints of polar extracts and the mineral element platform. Overall, grouping was less clear for the three platforms, detecting mainly primary metabolites (polar extracts). The grouping of accessions based on metabolites detected by SPME-GCMS and especially LC-QTOF-MS mostly showed a relatively high correlation with the genetic distances between accessions, while grouping based on primary metabolites targeted by NMR profiling data showed a relatively lower correlation. For rind samples (Figure S4), the HCA showed a partial or complete clustering of accessions from the Inodorous Group for all six platforms, except for LC-QTOF MS of semi-polar extracts for which distances within and between clusters were similar. Grouping of Cantaloupe accessions was also clear with GC-MS of polar extracts, except for one accession.

For the 44 accessions having both GBS and metabolome or elemental data, the association between the genetic distance matrix and each metabolomic or elemental distance matrix was measured using a Mantel test, separately for flesh and rind data (Table 3). Among the 14 associations between genetic and metabolomic/elemental distances that could be tested, namely 8 for flesh, and 6 for rind (volatiles and microelements were not measured in rind), 13 including the flesh microelement-based matrix, showed a significant association with a *p*-value <0.0001. Only the metabolomic distance matrix based on LC-QTOF-MS of semi-polar extracts in rind was not related to genetic distances. For flesh, the highest two correlations between the genetic and a metabolic/elemental distance matrix were observed for specialized metabolites, i.e., for LC-QTOF-MS of semi-polar extracts (*r* = 0.56) and for SPME GC-MS of volatiles (*r* = 0.39). For rind, the highest two correlations between the genetic and metabolic/elemental distance matrix were observed for the 1H-NMR fingerprints of semi-polar extracts (*r* = 0.56) and for FIE-MS of semi-polar extracts (*r* = 0.47).


*Metabolites* **2020** , *10*, 121

#### *2.4. Feature Selection by Random Forest Technology*

To select molecular features that were relevant for the metabolic classification of melon accessions, we applied random forest (RF) machine learning technology. RF technology tuned towards metabolic feature selection enables the selection of small sets of molecular features that, if manually supervised, can be relevant for sample classification [29]. For this purpose, we used the set of 605 provisionally annotated molecular features from the combined >80,000 data set (Table S4). To create approximately balanced classes, we split the available melon accessions into six subsets, namely 1) all accessions of subsp. *agrestis*, 2) the Cantalupensis accessions, 3) the Flexuosus accessions, 4) the Inodorus accessions, 5) the Reticulatus accessions of subsp. *melo*, respectively, and 6) a class that contained all other melon Groups that were represented by only one or two accessions (Figure 4A). Because of the diverse nature of the sixth class, we did not expect good classification results, but we added this class to contrast with the remaining more populated classes that had 5–14 members (Figure 4).

**Figure 4.** Random forest (RF) analysis (**A**) and decision tree classification (**B**) of six *C. melo* accession classes using a subset of 605 provisionally annotated molecular features. Classification of six pre-defined melon accession classes was performed. The classification table (A) lists classes, class size, the actual and predicted class membership, and the classification error (means of 10 iterations using hyperparameter-tuned RF settings). The decision tree uses the top 20 most informative molecular features ranked by the mean decrease in accuracy. The node information of the decision tree reports the used molecular feature code (Table S4) and threshold value. The branch information (colored ovals) lists the main class, the fraction of classified samples, left to right, subsp. *agrestis*, Cantalupensis, Flexuosus, Inodorus, miscellaneous, and Reticulatus accessions of subsp. *melo*. The percentage value indicates the fraction of the 52 accession samples that fall into each of the diagnostic categories.

Hyperparameter tuning of the RF procedure was applied to the six melon classes and based on the annotated data subset, provided the parameter settings: Sampled fraction of 0.899 for RF-classification, minimal node size of 2, and mtry value of 59. The estimate of the overall error rate across 10 repeated RF classifications was 23.9% ± 2.9 %, mean ± standard deviation (SD). The classification error differed, however, between tested classes. Subsp. *agrestis* classified best, with an average classification error of 0.02, followed by subsp. *melo* Inodorus with an error of 0.05. Subsp. *melo* Cantalupensis and subsp. *melo* Reticulatus classified with errors of 0.17 and 0.19, respectively. Unexpectedly, Flexuosus accessions were difficult to classify using the annotated metabolites. Molecular features relevant for the classification of the Flexuosus Group may be present among the non-annotated features (Figure 3B,C). Using annotated molecular features, the Flexuosus classification failed and had a high classification error, similar to the class of miscellaneous accessions (Figure 4A).

The annotated mass features were ranked by RF analysis according to the variable selection parameter mean decrease in accuracy (Table S4). A decision tree based on the top 20 ranks of annotated mass features used only three features (Figure 4B). This decision tree had four classes but failed to define a diagnostic rule for the Flexuosus and miscellaneous classes. One decision rule defined a class that only contained subsp. *agrestis* accessions to 100%. A second class defined Cantalupensis accessions to 89% with a minor contribution of subsp. *agrestis* accessions (11%). A third class contained 87% Inodorus accessions with 6% each of subsp. *agrestis* and Flexuosus. The fourth class was, with 42%, enriched for Reticulatus accessions but contained, in addition, 21% Flexuosus and 37% miscellaneous accessions. The top 20 informative molecular features according to RF analysis were glucose, formic acid, glycerol, arabitol, galactinol, and raffinose in rind; xylose and 1-kestose in flesh; 10 volatile organic compounds detected in melon flesh; and Cr and As elements detected in melon flesh (Table S4). These features provide additional supporting information for melon classification.

#### **3. Discussion**

#### *3.1. Phylogenomic and Phytochemical Relationships Partly Coincide in C. melo*

Besides contributing to the genetic classification of the species, the goal of this research was to determine whether phylogenomic and phytochemical relationships in melon coincide and are mutually supportive. Studies of this nature have been carried out for over 50 years but have been limited in two significant respects. Firstly, due to the limiting nature of the targeted chemical analyses performed in earlier studies, chemosystematics was limited to interspecific classification where biochemical differences were large. Thus, interspecific chemosystematics, in conjunction with genomic classification were successfully implemented to distinguish, for example, among *Capsicum* (pepper) species [30], *Brassica* species [31], and *Citrus* species [32]. However, large-scale infraspecific classification based on chemosystematics has not been successful to date. Secondly, the comprehensiveness of metabolomic analyses, as is presented here, was made possible only recently, and earlier studies attempting chemosystematics focused on only single metabolite families, such as volatiles and carotenoids [26]. This inevitably not only limited the breadth of the characterization but also could cause strong bias in the results, especially if the metabolites most impacting the clustering have also been of selective value, either through natural selection or human selection via breeding. The variability of select metabolite families is generally controlled by few genes, which are strongly affected by selection pressure. Hence, for crop varieties, targeted analyses of metabolite variation may not reflect the genetic distance. For example, the impact of a trait, such as non-climacteric ripening, although determined by few genes, is expected to have wide-ranging pleiotropic effects on numerous secondary metabolites, including volatiles [33].

Our study, which is based on the combination of large-scale unbiased genetic and similarly broad characterization of metabolomic phenotypes, allowed us to determine the degree of coincidence between the phylogenetic and chemosystematic classification of the *C. melo* species. We expected to have improved representation of phylogenetic/genomic relationships in our data set by including

non-targeted metabolomic technologies, as compared to a multi-targeted approach that may cover predominantly those metabolic traits that were directly subject or linked to the selective breeding process. Human breeding in melon largely aims for metabolic traits and therefore, breeding can lead to convergence of such traits in cultivars of different genotypic backgrounds and evolutionary histories. We hypothesized that non-targeted metabolites, such as those detectable by the UPLC-TOF-MS and GC-TOF-MS technologies, were presumably of less selective value during domestication and breeding than the targeted metabolites of volatiles, sugar, and organic acid levels, which together comprise the major determinants of quality and hence selection. We therefore predicted that our broad non-targeted profiling would cover the potentially convergent breeding traits, but these would be diluted by metabolic and elemental traits that were not targeted by breeding and hence hypothesized to have higher discriminatory power. Thus, chemosystematics based on non-targeted metabolites would be more likely to reflect true genetic relationships.

Systemization of melon variation goes back at least as far as the 1832 monograph of Jacquin [34]. These early classifications and nearly all subsequent ones were based primarily on fruit traits. The introduction of molecular genetic technologies has helped greatly in evaluating prospective infraspecific relationships. For example, the infraspecific classification proposed for the extremely variable *Cucurbita pepo* L. (Cucurbitaceae) that was based mainly on fruit shape [35] was supported and clarified by polymorphisms in 134 simple sequence repeat loci [36]. Rapid development of new technologies has allowed for successively and increasingly more comprehensive and precise identification and analyses of genetic variability. Most recently, high throughput sequencing and massive GBS strategies have been applied to melon variability in representative GWAS populations [8,37], yielding 17,000 and 22,000 genetic polymorphisms, respectively. Such large-scale genotyping allows for a more precise assessment of genetic relationships within species. The genotype classification of 44 of the 51 accessions used in this study was derived from the GWAS GBS previously reported by Gur et al. [37]. The remaining accessions not included in that GWAS study but screened metabolomically in this study could be reliably placed within Groups based on an additional smaller set of polymorphisms identified by direct PCR product sequencing.

Our results present a mixed picture; on the one hand the metabolomics-based infraspecific classification indeed largely reflects the phylogenetic classification. The correlation between genetic and metabolomics distance increases with the addition of metabolite signals, as expected, and the >80,000 combined metabolite signals best mirror the genetic classification (Figure 3). However, on the other hand, there are exceptions to a straightforward correlation, as will be discussed here further, and the metabolomics-based classification at times may better reflect the market-type classification rather than the genotype classification. This may indicate that selection and breeding for particular fruit characteristics may have had a disproportional impact on metabolites for which breeding did not directly select.

The present phylogenetic results substantiate the dichotomy within *Cucumis melo* to subsp. *agrestis* and subsp. *melo* (Figure 2). Furthermore, the results suggest that only the East Asian cultivar-groupGroups, Chinensis, Makuwa, and Conomon, belong to subsp. *agrestis*. The Duda'im and Momordica accessions that were included in our study are allied with subsp. *melo* but form an outlying cluster within that subspecies, as recently suggested [8,17,19]. However, with respect to the comparison of phylogenetic and the combined metabolomics-based classification (Figures 2 and 3), the *agrestis* subspecies does not behave as a single Group. The two Chinensis and five Makuwa types of the *agrestis* subspecies are similar to each other based on their metabolite composition and clearly clade together. However, they are distinct metabolomically from the other *agrestis* subspecies Group Conomon. Instead, the three Conomon accessions, BSK, FRC, and TOG, clade with the Flexuosus Group, along with the single Momordica accession PI414723 and the undecided accession PI149169. Thus, the Flexuosus and Conomon accessions are metabolically more similar to each other than would be expected based on their genetic relationships. Fruits of both Groups are used primarily when young, either fresh or pickled, similar to cucumbers. The mature fruit of both Groups are characterized by an

acidic pH and low sugar content, in contrast to the dessert melons with low acidity and high sugar content in the edible mature fruit [38]. In light of the fact that these two traits are governed by only a few genes [38–40], we expected that the large-scale metabolomics studies would overcome the bias of these few targeted metabolites and would more closely reflect the unbiased genetic relationships combining the three *agrestis* Groups. Our results with respect to *agrestis* suggest otherwise.

#### *3.2. Cantalupensis and Reticulatus Accessions are Separated from Each Other Both Genotypically and Metabolomically*

With regard to the Cantalupensis and Reticulatus accessions, the combined metabolomics-based classification does strongly parallel the phylogenetic classification. The separation of Reticulatus/Cantalupensis into two individual clades, IIb and IIc (Figure 2), that we observed in the genetic cladogram is similarly evident in the metabolomics classification (Figure 3). The Reticulatus Galia types (MAK, KRY) and the Cantalupensis Ha'Ogen types (OGE, NY, PH406) have a distinct metabolite profile from the Reticulatus/Cantalupensis Groups of American cantaloupes and Charantais. Furthermore, within each of the two Reticulatus/Cantalupensis Groups, the Cantalupensis and Reticulatus accessions are separated from each other, both genotypically and metabolomically. This is evident in the separation between the Ha'Ogen and Galia types, as well as the American cantaloupe and Charentais types from each other. The Ananas-type Ameri accessions, EDO and AYO, group with the Galia and Ha'Ogen accessions, as they did in the phylogenetic classification. Thus, the results relating to the climacteric, netted, and semi-netted *melo* subspecies (Reticulatus/Cantalupensis/Ameri) do support the similarity between genetic and metabolomics-based classification.

The Cantalupensis and Reticulatus Groups are horticulturally distinguished largely according to the extent of rind netting or reticulation, which is actually a continuous trait rather than two distinct phenotypes of netted versus smooth. The trait of rind reticulation is presumed to be relatively simply inherited [39,41,42], but there remains the possibility that the netting trait, present in the different clades of Reticulatus/Cantalupensis, may be under separate genetic control. The most visual distinction between the different clades of Reticulatus/Cantalupensis is the flesh color. While the Charentais-type Cantalupensis are orange-fleshed, the Ha'Ogen-type Cantalupensis are green-fleshed. Similarly, the American cantaloupe Reticulatus varieties are orange-fleshed while the Galia Reticulatus are green-fleshed. Flesh color is also determined by relatively few genes [40,43–45], but the associated pleiotropic metabolomic effects due to chloroplast to chromoplast transformations are expected to be large.

#### *3.3. The Metabolomic-Based Classification May Indicate That Two Independent Evolutionary Events Led to Non-Climacteric Ripening*

The results of the GC-MS analysis of melon volatiles (Figure S3) indeed showed that volatile metabolites alone distinguish between the non-climacteric and climacteric groups. However, our comprehensive genomic and metabolomic results also indicate that the Inodorus Group is genetically distant from the other cultivar-groups and this large genetic distance cannot be attributed only to the limited climacteric-related genetic differences but also primarily to other unrelated genetic evolution.

Concerning the Inodorus Group, most accessions are clearly sorted on the basis of their combined metabolites, similar to their phylogenetic classification. The inclusion of the Kirkagac Inodorus accessions clearly within the Inodorus metabolite group even though they were genetically distinct from it further indicates that the relationship between genetic and metabolomics-based classification is not simple, even when the latter is based on a large number of metabolic signatures as in our study. The genetic distinction between the characteristic Inodorus Group melons and non-characteristic Inodorus types referred to as European types was also noted by Leida et al. [20].

These results regarding the genetic relationships between most of the Inodorus accessions of clade IIa (Figure 2) and those of the Kirkagac accessions in clade IIb are significant as they indicate that there may have been two independent evolutionary events leading to non-climacteric ripening, one represented in the large Inodorus Group and the other more closely related to the clade IIb climacteric types. The genetic control of climacteric ripening in *C. melo* has been studied and found to be due to mutations in just a few genes [46,47]. However, these genetic studies were performed using typical Inodorus accessions as a genetic source for the non-climacteric trait and typical climacteric Cantalupensis (Charentais) accessions. The possibility remains that the genetic cause of non-climactericism in the IIa and IIb clades may not be identical and may even be due to mutations in different genes. In support of this possibility, Eduardo et al. [48] and Vegas et al. [49] reported that a supposed non-climacteric Conomon accession, crossed with the non-climacteric Inodorus PDS, yielded climacteric phenotypes, indicating independent genetic mutations. In fact, Saladie et al. [50] pointed out that classification of melon fruit ripening behavior into just two distinct types is an over-simplification, and that, in reality, there is a continuous spectrum of fruit ripening behavior. If so, further study of the relationship between the non-climacteric traits of Kirkagac and other Inodorus genotypes could broaden the genetic variability available for this important characteristic for genetically improving fruit quality by extending the harvest period as well as the shelf life of the horticultural product.

While practically all the Inodorus accessions metabolomically clade together, the CRE variety is a striking exception (Figure 3). CRE is most similar with respect to its metabolite signature to the single Khandalak accession INB, and together are most closely related to the Charentais Cantalupensis accessions. In fact, CRE is strikingly distinct phenotypically from the other Inodorus accessions, as can be seen in Figure 1. Whereas all the Inodorus lines studied are green-fleshed, as are non-climacteric Inodorus lines in general, the CRE variety is orange-fleshed and likely also climacteric. In fact, the GC-MS volatile results of flesh (Figure S4e) also distinguish between CRE and the other Inodorus lines and place CRE together with the BEL Ha'Ogen-type Cantalupensis. The CRE accession is a Crenshaw market type, which is considered to be a hybrid derivative of the non-climacteric, green-fleshed, smooth-skinned Casaba market-type melon and the climacteric cream/orange-fleshed Persian Cantalupensis melons and thus this may explain that metabolomically, it is similar to the latter.

The different analytical strategies and over all the different types of compounds targeted (primary or specialized metabolites, mineral elements) contributed differently to the classification of melon accessions. This may be related to the fact that some metabolite classes were under breeding selection and may represent convergence of metabolic traits rather than genetic distance. The latter seems probable for major primary metabolites, measured using 1H-NMR or GC-MS of polar extracts in flesh and implicated in sweetness or acidity. Convergence may also be the case for different families of specialized metabolites measured using LC-MS of semi-polar extracts, such as phenolics or alkaloids with an astringent or bitter taste, or GC-MS of volatiles, such as esters and alcohols. Even mineral elements in flesh contributed to accession classification, possibly through a link with flesh acidity, for instance, for K, or metabolic links between metabolites and mineral elements as shown previously [27]. Selection may have been less stringent for rind than for flesh composition, as for five out of the six analytical strategies used for both flesh and rind analyses, the correlations between compositional distances and genetic distances were higher for rind than for flesh (Table 3).

The correlation between genetic and metabolomics distance increased with the addition of the >80,000 combined molecular features and best mirrored the genetic classification. Nevertheless, information reduction by RF analysis allowed pointing to 20 informative molecular features that differentiate several accession groups. In flesh, 10 volatiles and xylose are possibly linked with climactericity as discussed above. Of these top 10 volatiles, 6 could be reliably annotated and are well known components of melon fruit, including (Z,E)-3-hexen-ol and 1-octen-3-ol, which result from fatty acid degradation, and also two typical acetate esters (2-methyl butyl acetate and pentyl acetate) as well as alpha-ionene [27,51]. Xylose is a major non-cellulosic neutral sugar of cell walls in cucurbit fruit [52]. Fruit softening during maturation involves cell wall degradation. For cell wall-related genes, specific genes of each gene family can be categorized as totally ethylene dependent, totally ethylene independent, or partially ethylene dependent [3]. Surprisingly, two mineral elements were in the top 20 molecular features highlighted by the RF analysis. Among the six compounds highlighted in

peel, glycerol may be related with the fruit surface, such as cuticular waxes [53] or suberization [54]. Galactinol and raffinose linked classification to the raffinose family and to oligosaccharides metabolism in the peel (Table S4). Galactinol and raffinose linked with the raffinose family of oligosaccharide metabolism after phloem unloading into the melon fruit may be related to fruit sweetness [55].

#### **4. Conclusions**

The main objective of this research was to determine whether classification by large-scale, non-targeted metabolomic and element profiling technologies recapitulates or extends the phylogenomic relationships within *Cucumis melo*. Our results indicate that, in general, metabolomic/elemental means of classification can indeed significantly reflect the genetic relationships. Nevertheless, there are deviations of metabolomics and elemental groupings from genetic classifications. These differences indicate that the selection for major phenotypic quality traits by the breeding process has been influential. Some of the quality traits that can be controlled by single or few enzymatic or regulatory genes, include changes of metabolite levels that define color, taste, and flavor. In addition, some traits change fruit development and thereby have pleiotropic effects on fruit metabolomics.

Large-scale information on genomic sequence and metabolomic/elemental variation of a broad range of genetic diversity can serve for dissecting the genetic basis of metabolic diversity. This process has been recently referred to as mGWAS (metabolite genome wide association study) [56]. mGWAS presents a powerful tool for attributing metabolite variation to particular genetic regions. In melon, GWAS based on GBS was even capable of mapping traits to the single candidate gene level [37] and it is expected that the comprehensive metabolomics data presented here will allow for a large-scale mapping of metabolic traits.

Furthermore, this work sets the stage for an unlimited number of metabolic QTL (mQTL) analyses based on recombinant populations generated from selected metabolically characterized parental lines. The identification of valuable genetic and metabolic variability forms the basis for directed crop diversification and genetic improvement by breeding. The future combination of the results of this study with gene expression data of the developing melon fruit rind and flesh provides an 'omics' blend of genomics, metabolomics, and transcriptomics that will be especially useful for the identification of the genetic basis of metabolic diversity [44].

#### **5. Materials and Methods**

#### *5.1. Plant Material Description, Cultivation, and Sampling*

An initial core collection comprised of 51 *Cucumis melo* accessions representing a broad spectrum of melon genetic variation was grown as a spring-summer crop in an open field at the Newe Ya'ar Research Center in the Yizre'el Valley, northern Israel. The accessions prospectively represented both subspecies and 11 of the cultivar-groups (Table 1; Figure 1; Table S1). These 51 accessions were a representative subset of a larger collection of 177 accessions that were used for detailed characterization of fruit trait variation and evaluation of the potential of genome-wide association (GWA) for trait mapping in melon [37].

Of the 51 accessions used for metabolomics analysis, 7 were not included in the GBS study but were genotypically classified in a less encompassing sequencing project based on direct sequencing of 20 PCR products of known genes. This direct sequencing comprised a total of nearly 12,000 bp and provided 116 polymorphisms for which haplotypes were produced and used for the phylogenetic analysis (Table S2). This phylogenetic analysis is not as comprehensive as the GBS but was useful in placing those accessions for which GBS data was missing.

Among these 51 accessions were included two closely related Duda'im melons (DUD2, DUD3) for a total of 52 sampled accessions. Each accession was represented in three replicated plots of five plants each. A minimum of 10 ripe fruits were harvested from each accession. Fruit were photographed prior to sampling. For sample preparation, a midsection of each fruit was excised, and its seed cavity removed. For the rind sample, the rind was removed using a vegetable peeler to a depth of approximately 2–3 mm and the rinds from all fruit of each accession were combined into a single sample. Similarly, the flesh samples of each accession were combined. Combined bulk samples were ground to a powder in liquid nitrogen. Falcon tubes were filled with approximately 50 mL of powdered sample and a total of 104 tubes (flesh and rind samples of 52 accessions) were stored at −80 ◦C until shipment under dry ice to each of the participating laboratories where two replicate samples were analyzed per accession/per platform.

#### *5.2. Genotype and Phylogenetic Analysis*

#### 5.2.1. DNA Isolation for Genotype by Sequencing (GBS)

For the GBS of the melon collection, leaf tissue was taken from 44 of the 51 accessions, listed in Table 1. Total genomic DNA isolation was performed as described by Gur et al. [37] using the GenEluteTM Plant Genomic DNA miniprep kit (Sigma, St. Louis, MO, USA). The quality of the DNA was analyzed with an ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE) and by electrophoresis on agarose gel. The concentration of DNA was estimated using a Qubit® 2.0 Fluorometer (Life Technologies, City, Singapore) and a Qubit® dsDNA BR Assay Kit (Life Technologies, Eugene, OR, USA).

#### 5.2.2. GBS Analysis

DNA was analyzed at the Institute for Genomic Diversity facility at Cornell University for GBS. GBS 96-plex libraries were prepared using the restriction enzyme ApeKI, following an established protocol [57]. Fragments were sequenced on an Illumina HiSeq 2500 as 100 bp, single-end reads and aligned to the reference genome of *C. melo* [58] available at https://melonomics.net/files/Genome/Melon\_ genome\_v3.5.1/. TASSEL pipeline v3.0.173 was used for sequence alignment and single nucleotide polymorphism (SNP) calling [59]. Further filtration was performed using TASSEL v5.2.33 [60]; the SNP list was filtered to MAF > 0.05 and a maximum of 6% missing data per site.

#### 5.2.3. Phylogenetic Analysis

TASSEL software (v5.2.33) was used to estimate the distance matrix for all pairwise combinations. The phylogenetic tree based on the GBS results of the 44 accessions was assembled using the neighbor-joining function. In addition, we constructed a more limited phylogenetic tree based on 116 polymorphic sites derived from direct sequencing of 11,950 bp derived from 23 genomic sequences (Table S2). The sites were concatenated to create a pseudo-sequence haplotype, which was then aligned (Figure S1a). The aligned sequences were used to build a neighbor-joining tree, and a Phylip distance matrix (Figure S1b). The final alignment, tree, and distance matrix were performed using Clustal Omega version 3.0, www.ebi.ac.uk/Tools/msa/clustalo.

#### *5.3. Global Measurements of Fruit Quality*

Fruit total soluble solids (TSS, as degrees Brix) and pH were measured on a set of the frozen powdered samples. Portions were allowed to defrost, and juice was measured using a hand-held refractometer (Atago A-10) and a pH meter for each of the 52 accessions used for metabolomic and elemental measurements. Percent of dry matter (%DM) was measured by drying a representative sample in a 60 ◦C oven and weighing before and after. Means per accession were calculated.

#### *5.4. Metabolomics and Elementals Analysis*

#### 5.4.1. NMR-Based Metabolomic Analyses

For targeted 1HNMR, polar metabolites were extracted from 50 mg of lyophilized powder using a hot ethanol/water series, and then analyzed, identified, and quantified by 1H-NMR profiling as previously described [27,61]. For the preparation of extracts and NMR acquisition parameters, special care was taken to allow absolute quantification of individual metabolites. Quantitative 1H-NMR spectra were recorded at 500.162 MHz and 300 K on a Bruker Avance spectrometer (Wissembourg, France) with a 5-mm inverse probe using a 90◦ pulse angle and an electronic reference for quantification with calibration. Two replicates were extracted and analyzed for each accession. Unknown metabolites were named using the mid-value of the chemical shift and the multiplicity of the corresponding resonance group and quantified in arbitrary units. The same spectra issued from polar extracts were also processed as fingerprints: The spectra from δ9.40–0.40 ppm (without the residual water resonance) were binned to chemical shift regions of 0.04 ppm and data were scaled to the total signal intensity.

For untargeted NMR fingerprinting of semi-polar extracts, metabolite extraction was carried out according to the methods described by Ward et al. [62]. Briefly, triplicate aliquots (15 mg) of freeze-dried powder were extracted with an 80:20 D2O:CD3OD mixture (1 mL) containing *d4*-TSP as the internal standard (0.01% *w*/*v*). Samples were extracted for 10 min at 50 ◦C. Supernatants were transferred to a clean tube and heated to 90◦C for 2 min. Samples were cooled and centrifuged. For 1H-NMR, 200 μL of the supernatant was evaporated to dryness and reconstituted in 650 μL of deuterated sodium phosphate buffer solution (pH 6.0, 200 mM). Samples were mixed and allowed to stand at room temperature for 1 h after which 600 μL were then transferred to a 5-mm NMR tube for 1H-NMR analysis. 1H-NMR spectra were collected at 600 MHz on a Bruker Avance spectrometer equipped with a 5-mm selective inverse probe. Parameters for data acquisition are as described in Ward et al. [62]. 1H-NMR data (δ9.395–0.505 ppm) were binned to chemical shift regions of 0.01 ppm and data were scaled to the total signal intensity. Regions corresponding to residual water and methanol were removed (H2O 4.775–4.865; MeOH 3.285–3.335 ppm set to zero).

#### 5.4.2. GC-MS-Based Metabolomic Analysis of Polar Compounds

GC-MS profiling analysis of a polar metabolite fraction enriched for primary metabolites, including sugars, amino acids, and organic acids, was performed as described earlier [63] with modifications reported by Moing and co-authors [27]. Samples were extracted, partitioned into a polar liquid phase, and dried for chemical derivatization, as described. Methoxymation was performed with 40 μL of pyridine containing 1 mg/mL methoxyaminohydrochloride and 80 μL silylation-mixture, containing 7:1 (*v*/*v*) N,O-bis(trimethylsily)-trifluoroacetamide (BSTFA, Macherey-Nagel, Düren, Germany) and a mixture of alkanes in pyridine. GC-MS was by splitless mode after injection of 1 μL of chemically derivatized sample. Evaluated mass features had intensities of ≥ 50 arbitrary intensity units. Peak heights of mass features defined by nominal mass to charge ratios (m⁄z) and *n*-alkane based retention indices were normalized to the sample fresh weight and the internal standard, succinic-*d*<sup>4</sup> acid (Sigma-Aldrich, Deisenhofen, Germany). The internal standard was added to the extraction solution. Normalized mass spectral features were aligned, correlated across all recorded samples, and placed into clusters and time groups using TagFinder [64]. Metabolite annotations of mass spectral features were manually supervised using TagFinder visualizations for mass spectral matching [64]. Metabolite annotation required a match of at least three co-eluting and correlated mass features and a retention index deviation < 5% [65] compared to the reference data of the Golm Metabolome Database, http://gmd.mpimp-golm.mpg.de/ [66].

#### 5.4.3. GC-MS-Based Metabolomic Analysis of Volatile Compounds

For GC-MS of volatile organic compounds, headspace-solid phase micro extraction (SPME) with a 65-mm polydimethylsiloxane-divinylbenzene fiber (Supelco, Bellefonte, USA) was used as described previously [67]. In short, 200 mg of frozen powder was mixed with 4 mL of 4.6 M CaCl2 containing 5 mM EDTA in a 10-mL vial. Volatiles in the sample headspace were trapped for 20 min at 50 ◦C with agitation (CombiPAL autosampler, CTC Analytics, Switzerland) and thermally desorbed in the GC injection port for 1 min at 250 ◦C. A HP-5 column, 30 m x 0.25 mm ID, 1.05 μm – film thickness (Hewlett Packard, Palo Alto, CA, USA) was used to separate the volatile compounds, applying a temperature gradient from 44 to 250◦C at a speed of 5 ◦C min<sup>−</sup>1. All masses from m/z 35 to m/z 400 resulting from 70 eV electron impact ionization were recorded (MD800 electron impact MS, Fisons Instruments, Milan, Italy). Metalign software [68] was used to extract and align all mass features in an untargeted manner and masses originating from the same molecule were then clustered to reconstruct the relative intensity and mass spectrum of each detected compound, using MSClust software [69]. Volatiles were putatively annotated by matching the reconstructed mass spectra of detected compounds to the electron impact mass spectral libraries of NIST (National Institute of Standards and Technology, Gaithersburg, MD, USA) and Wiley. Subsequently, for the random forest (RF) machine learning analysis on the putatively annotated data (Supplementary Table S4), the top 10 volatiles arising as potential markers were then further manually checked for matching of the retention index and mass spectra. Annotations with an RI >20 and/or a match factor <750 were reclassified as 'unknowns'.

#### 5.4.4. FIE- or LC-MS-Based Metabolomics Analysis

For FIE-MS, 50 μL of the semi-polar extract supernatant prepared for NMR fingerprinting was diluted with 80:20 H2O:CH3OH (950 μL). Samples were analyzed using an Esquire 3000 spectrometer (Bruker Daltonics, Coventry, UK) in both positive and negative ionization modes as described previously [24]. Spectral data were exported as ASCII files containing mass-intensity pairs and automatically reduced using AMIX software version 3.9.11 (Bruker Biospin, Coventry, UK), to a single CSV file for each ionization mode, containing integrated regions of equal width (*m*/*z* width = 1). Individual signal intensities were scaled to the total intensity and m/z regions relating to *d4*-TSP and its isotope peaks were removed from the data prior to statistical analysis.

For LC-QTOF-MS, ground frozen rind and flesh tissue (0.5 g) were extracted in 1.5 mL of methanol, containing 0.1% formic acid. The samples were vortexed vigorously, sonicated for 20 min, and centrifuged for 15 min at 4400 *g*. For rind samples, the supernatant was filtered through a 0.22-μm PVDF filter directly to HPLC vials. Flesh samples were further concentrated as follows: 1 mL of supernatant was freeze-dried and resuspended in 150 μL of 75% aqueous methanol, containing 0.1% formic acid, sonicated for 20 min, and filtered through a 0.22-μm PVDF filter directly to the insert of the HPLC vial. Melon samples were injected into a UPLC-QTOF-MS (HDMS-Synapt, Waters, Manchester, UK) in negative ionization mode as in [70] with some modifications: Short 9.5-min gradient was used for metabolite separation. The linear gradient program was as follows: 100% to 90% phase A over 1 min, 90% to 75% phase A over 3 min, 75% to 55% phase A over 2 min, and 55% to 0% phase A over 0.5 min; held at 100% phase B for 1 min; and then returned to the initial conditions (100% phase A) in 1 min and conditioning at 100% phase A for 1.0 min. A divert valve (Rheodine) excluded the first 1.3 min and last 1.8 min of injection. A mixture of 15 standard compounds, injected after each 10 samples, was used for quality control. XCMS software [71] was used for peak picking and peak alignment. Intensity values were loge-transformed.

#### 5.4.5. Elemental Analysis

For microelements, frozen milled melon samples were freeze-dried and 200–250 mg dry material was wet digested (5 mL 65 % HNO3 and 5 mL 15 % H2O2) at 210 ◦C in 100-mL closed vessels using a microwave oven (Multiwave 3000 software version 1.24, Anton Paar, Graz, Austria). Before analysis, the digests were diluted to a final concentration of 3.5% HNO3. Multi-element analysis was performed using an ICP-MS (Agilent 7500ce, Agilent Technologies, Wokingham, UK) as described in Bernillon et al. [24]. The impact of spectral interferences was reduced using an octopole ion guide, pressurized with He or H2 certified reference material (NIST 1515, apple leaves, particle size <75 μm; National Institute of Standards and Technology, Gaithersburg, MD, USA) were included. Only elements deviating less than ±10 % from the certified reference values are reported.

#### 5.4.6. Data Handling and Mining

Metabolome and elemental data, i.e., the means of at least two technical replicates per accession sample and of each fruit tissue, were gathered from the consortium of laboratories. Initial response/abundance values below zero were judged to be missing data and removed. Data of each molecular feature were log10-transformed response ratios relative to the median across all samples. Subsets of each analytical technology, including both concatenated flesh and rind profiles, were 0.1 - 0.9 quantile range-normalized to harmonize the diverse linear response ranges of each profiling technology. Missing data were replaced by zero to give enhanced weight to subspecies- or group-specific molecular features rather than inferring values by more elaborate means, e.g., [72,73]. The resulting data set of > 80,000 molecular features was analyzed either in combined mode or divided into subsets.

Independent component analysis (ICA) of 52 samples included 51 accessions (Table 1) and a second independent sample of the Duda´im accession. ICA was applied to the first three components of a preceding principal component analysis and was performed using the MetaGeneAlyse web-application, https://metagenealyse.mpimp-golm.mpg.de/ [28]. ICA scores values were plotted. The Multi Experiment Viewer software v 4.9 [74] generated the covariance matrix of the combined data set and a support hierarchical cluster analysis (support HCA) using covariance distance, complete linkage, and bootstrapping by 1000 iterations. Bootstrap values range from 1–100 and represent the frequency of tree-node occurrence across the iterations.

In order to compare accession clustering based on genetics with the ones based on compositional distances for each tissue and each analytical strategy separately, Euclidian distances were calculated using the corresponding metabolome or ionone sub-dataset with zeros for undetected metabolomic features for the 52 accession samples with Multi Experiment viewer v 4.9. Then, accession sample clustering based on compositional data was performed under R (https://www.R-project.org/) using the Euclidian distance matrices calculated with Multi Experiment viewer and with the "complete" method for sample hierarchical clustering. In order to compare distances between accessions based on genetics with the ones based on composition, each compositional Euclidian distance matrix was restricted to the 44 accessions having both molecular and fruit metabolome or elemental data. Then, a Mantel test was done to measure the association between the molecular distance matrix and each metabolomic or elemental distance matrix by calculating the Pearson correlation between these two matrices, with its *p*-value calculated with 10000 Monte Carlo simulations using XLSTAT (v2019, Addinsoft, Paris, France). When the *p*-value was below 0.001, the two matrices were considered as significantly correlated.

To select annotated metabolic features that are relevant to classifying the *C. melo* accessions, random forest (RF) analysis was performed using the subset of annotated molecular features [27] from the combined data set (Table S4). Feature selection by RF technology was as described earlier [29] using hyperparameter optimization proposed by [75]. The R-package randomForest v 4.6–14 [76,77] was used for training classification trees. The hyperparameter tuning was performed using the tuneranger function from the tuneRanger package V 0.2 [75]. The optimized parameters were mtry, node size, and sample size. Other non-tuned RF parameter settings were ntree = 10,000, importance = TRUE, replace = TRUE. The decision tree was created by the functions rpart and rpart.plot from the rpart V 4.1–13 and rpart.plot V 2.2.0 packages, respectively, using the method "class". We selected the top 20 molecular features according to the mean decrease of accuracy from 10 repeated RF analyses. A decision tree was created, limiting the tree to only those branches that contain all members of a class. The averaged mean decrease of accuracy and mean decrease of the Gini index were correlated by Pearson's correlation coefficient (r2) 0.931 assuming linear association.

All data were visualized by the respective R-packages and the Multi Experiment Viewer [74] in combination with Microsoft Excel and PowerPoint.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/10/3/121/s1, Figure S1: Phylogenetic tree and alignments based on direct PCR sequencing of melon accessions not included in GBS, Figure S2: Fruit quality global measurements of 52 accession samples, Figure S3: Hierarchical clustering analysis of 52 accession samples performed per analytical platform for fruit flesh, Figure S4: Hierarchical clustering

analysis of 52 accession samples performed per analytical platform for fruit rind, Table S1: Description of the melon accessions used in this study, Table S2: Genes and primers used for PCR-based genome sequencing and polymorphism identification, Table S3: Genetic distance matrix of the 44 GBS-sequenced accessions, Table S4: List of 605 provisionally-annotated molecular features from the combined data set with assignment of variable-importance obtained by RF technology.

**Author Contributions:** Conceptualization, A.A.S. and R.D.H.; sample preparation and biochemical analyses, A.A., J.W.A., B.B., J.B., M.H.B., C.D., A.E., R.G., T.H.H., D.R., M.M., S.M. (Sagit Meir), S.M. (Sonia Miller), R.M., I.R., J.K.S., R.C.H.d.V., J.L.W., E.Y.; genetic analyses, A.G., N.K., E.O., H.S.P., E.L., U.S., Y.T., Y.B., A.F., S.B.-D., G.T. data curation and mining, F.B., A.E., D.J., J.K., A.M.; writing and original draft preparation, A.E., R.D.H., J.K., A.M., H.S.P., A.A.S.; manuscript reviewing and editing, all authors; funding acquisition, R.D.H. All authors have read and agreed to the submitted version of the manuscript.

**Funding:** The data presented in this paper were generated within the META-PHOR project financed by the European Union Framework 6 programme [2006-FOODCT-036220].

**Acknowledgments:** A.E. and J.K. acknowledge technical and laboratory support of the GC-MS analyses by Ines Fehrle.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Color Mutations Alter the Biochemical Composition in the San Marzano Tomato Fruit**

**Gabriella Dono 1, Jose Luis Rambla 2,3, Sarah Frusciante 4, Antonio Granell 2, Gianfranco Diretto <sup>4</sup> and Andrea Mazzucato 1,\***


Received: 14 February 2020; Accepted: 12 March 2020; Published: 15 March 2020

**Abstract:** San Marzano (SM) is a traditional Italian landrace characterized by red elongated fruits, originating in the province of Naples (Italy) and cultivated worldwide. Three mutations, *yellow flesh* (*r*), *green flesh* (*gf*) and *colorless fruit epidermis* (*y*) were introduced into SM by backcross and the resulting introgression lines (ILs) produced the expected yellow, brown and pink fruit variants. In addition, ILs carrying double combinations of those mutations were obtained. The six ILs plus the SM reference were analyzed for volatile (VOC), non-polar (NP) and polar (P) metabolites. Sixty-eight VOCs were identified, and several differences evidenced in the ILs; overall *gf* showed epistasis over *r* and *y* and *r* over *y*. Analysis of the NP component identified 54 metabolites; variation in early carotenoids (up to lycopene) and chlorophylls characterized respectively the ILs containing *r* and *gf*. In addition, compounds belonging to the quinone and xanthophyll classes were present in genotypes carrying the *r* mutation at levels higher than SM. Finally, the analysis of 129 P metabolites evidenced different levels of vitamins, amino acids, lipids and phenylpropanoids in the ILs. A correlation network approach was used to investigate metabolite–metabolite relationships in the mutant lines. Altogether these differences potentially modified the hedonistic and nutritional value of the berry. In summary, single and combined mutations in *gf*, *r* and *y* generated interesting visual and compositional diversity in the SM landrace, while maintaining its original typology.

**Keywords:** fruit pigmentation; introgression lines; metabolomics; mass spectrometry; San Marzano landrace; tomato

#### **1. Introduction**

In tomato (*Solanum lycopersicum* L.), fruit color is one of the most important traits affecting consumer liking, and it is the result of combined effect of carotenoids, flavonoids and eventually chlorophylls. The red color of ripe fruit comes from the accumulation of all-trans-lycopene; mutants affected in the carotenoid pathway have an altered carotenoid composition, resulting in different fruit colors [1,2]. Besides carotenoids, flavonoids play a role in determining the color of tomato fruit, particularly at the epidermal level [3]. One of the most abundant flavonoids in tomato fruit peel is the yellow-colored naringenin chalcone, often preceding and paralleling the production of lycopene in the pericarp [4]. Lastly, chlorophylls can eventually have a role in defining the color of the fruit; although they are normally degraded during ripening, "stay green" (SGR) mutants exist, that maintain

important chlorophyll levels in the ripe fruit. Color variations do not only change fruit pigmentation, but they also affect the entire set of tomato fruit attributes, adversely or positively impacting the organoleptic and nutritional properties. Indeed, ripening involves a number of physiological processes ranging from the chlorophyll breakdown to the consecutive carotenoid accumulation, but also of other compounds belonging to the secondary metabolism, which play functional roles in plants [5], and have a potential in preventing diseases and promoting health in animals and humans [6,7]. Among them, acids, tocochromanols, quinones, fatty acids, sugars and polyols, as well as glycoalkaloids vary during ripening, thus modifying the nutritional value and the antioxidant activity of the fruit [8]. Three tomato mutants, *yellow flesh* (*r*), *colorless fruit epidermis* (*y*) and *green flesh* (*gf*), which are representative of the main classes of color pigments, were chosen to be studied at metabolite level, with the aim of assessing the impact of these mutations on the fruit metabolome with emphasis on compounds that co-participate in defining fruit qualities.

Yellow-fruited tomatoes have been documented since the first introduction in Europe [4,9]. The yellow color is due to the *r* mutation, represented by loss-of-function alleles of phytoene synthase 1 (*Psy1*; *Psy1* catalyzes the first rate-limiting step in the carotenoid pathway, the condensation of two molecules of geranylgeranyl diphosphate in phytoene [10]), which results in the inhibition of the whole carotenoid biosynthesis [11]. Recently, *r* tomatoes have been meeting with an increasing success, for the color novelty and the peculiar organoleptic qualities [12].

The *y* fruit mutant was originally described as a monogenic recessive variant leading to the formation of a colorless fruit peel [13]. The mutation, mapped on the short arm of Chr1 [14], involves the *SlMYB12* transcription factor, causing the lack of naringenin chalcone, one of the major flavonoids in tomato fruit peel, which gives the yellow color and has been proposed to influence the characteristics and function of the cuticular layer [15,16]. The pink *y*-type fruit mutation has been identified in numerous cultivated varieties that are highly consumed in Asian countries.

Fruits of *gf* tomato mutant were described for their characteristic muddy brown color, resulting from the accumulation of lycopene coupled with the heterochronic presence of chlorophyll in the ripen fruit due to a lack of chlorophyll degradation [17]. In *S. lycopersicum*, the *Gf* locus maps on the long arm of Chr8. Further studies indicated that *gf* is a member of the SGR gene family, *SlSGR1*, a Mg-decheletase gene needed for chlorophyll catabolism [18,19]. Nowadays, many cultivated tomatoes, heirloom varieties but also modern hybrids, exhibiting the *gf* phenotype are commercially available; indeed, these varieties are appreciated and have been reported to be superior for taste-related compounds [20,21].

San Marzano (from now on referred to as SM) is a traditional variety grown in the region of Naples, Italy; it is considered an important model for fruit quality parameters, because of its intense and uniform red color, which revealed peculiar sensory profiles. Indeed SM fruits are covered by the leaves, which promotes the lycopene accumulation, but at the same time reduces the sugar content, giving birth to its typical bittersweet flavor; this is also mainly due to the ratio between the sugar content, where fructose and galactose are the main carbohydrates, and organic acids, mostly citric acid, a major factor in determining sourness, but also flavor intensity [22–24]. The effect of San Marzano extracts has also been the object of studies aimed at assessing nutraceutical properties [25].

The *r*, *y* and *gf* mutations, together with other variants affecting the tomato fruit phenotype, have been introgressed by backcrossing into the common genetic background SM; the single mutant lines have also been used to express two variants in double combinations [26].

In this work, we analyzed the effect of introducing color mutations in the metabolite complement of the SM fruit. Namely we compared a comprehensive set of fruit volatile compounds (VOCs), including those involved in flavor, and both non-polar (NP) and polar (P) specialized metabolites, including those involved in health properties, in the *r*, *y* and *gf* single mutant lines and their respective double mutants, to the ripe fruit of the SM genetic background. The double mutants expanded the effect on metabolism of the introduced single mutations by revealing additional additive or epistatic effects that could be further exploited for the improvement and diversification of the SM landrace by introducing innovation, while maintaining the characteristic SM typology.

#### **2. Materials and Methods**

#### *2.1. Plant Material and Growth Conditions*

Three single mutant lines harboring *r*, *y* and *gf* and the double mutants representing all the possible combinations in the SM background were studied in comparison with the corresponding wild-type (Table 1). Details on the backcross scheme used to obtain these introgression lines, as well as the growth conditions used were reported before [26]. Briefly, eight plants per accession were transplanted and cultivated in twin rows in an unheated tunnel following standard cultural practices for indeterminate tomatoes, using tutors and weekly removal of lateral shoots. Daily temperature was controlled by a ventilation system and plants were irrigated through a drop system. The trial was repeated with identical materials and methods for two consecutive years (2017 and 2018).

**Table 1.** Extended names, genetic symbols and description of the berry color of the seven studied genotypes, including San Marzano (SM), three single mutant and their respective double mutants.


#### *2.2. Fruit Sampling*

Before sampling, berries were visually inspected and only intact and healthy tomatoes were collected. Two biological replicates for genotype, each represented by four fully ripe berries, were harvested over a period of three days, during the first week of August, for each year. A longitudinal pericarp wedge was excised from each of the four appropriately washed berries and cut into cubes; each replica, consisting of about 30 g of fresh material, was immediately frozen in liquid nitrogen and homogenized until a fine powder was obtained. Aliquots of about 10 g of this material were freeze-dried for the analysis of non-volatile secondary metabolites. All samples, both frozen and freeze-dried, were stored at −80 ◦C until analysis.

#### *2.3. Volatile Detection and Quantification*

For volatile analysis, two biological replicates and two technical replicates were processed and analyzed independently for the two year experiments. Prior to the analysis, frozen fruit powder (0.5 g fresh weight) from each sample was weighed in a 15 mL vial, closed and incubated at 37 ◦C for 10 min. Then, 1.1 g of CaCl2·2H2O and 500 μL of EDTA 100 mM (pH 7.5) were added; samples were then gently shaken and sonicated for 5 min. Then,1 mL of the homogenized mixture was transferred into a 10 mL screw cap headspace vial, where volatiles were collected by head space solid-phase microextraction as previously described [27]. A 65 μm PDMS/DVB SPME fiber (Supelco Analytical, Bellefonte, PA, USA) was used for all the analysis. Pre-incubation and extraction were performed at 50 ◦C for 10 and 20 min respectively. Desorption was performed for 1 min at 250 ◦C in splitless mode. Volatile extraction and injection were performed by means of a CombiPAL autosampler (CTC Analytics AG, Zwingen, Switzerland)). Separation and detection were performed by a 6890N gas chromatograph coupled to a 5975B mass spectrometer (Agilent Technologies Inc., Waldbronn, Germany) with DB-5ms fused silica capillary column (60 m, 0.25 mm, 1 μm) (J&W Scientific, Agilent Technologies Inc., Santa Clara, CA, USA). Oven temperature conditions were 40 ◦C for 2 min, 5 ◦C/min ramp until 250 ◦C and then held isothermally at 250 ◦C for 5 min. Helium was used as carrier gas

at 1.2 mL/min constant flow. Ionization was performed by electron impact (ionization energy, 70 eV; source temperature 230 ◦C). Data acquisition was performed in scan mode (mass range *m*/*z* 35–250; 6.2 scans per second). Chromatograms and spectra were recorded and processed using the Enhanced ChemStation software (Agilent). Untargeted analysis of all the compounds in the chromatogram was performed by means of the MetAlign 3 software [28]. Compounds were unequivocally identified by comparison of both mass spectra and retention time to those of pure standards (SIGMA-Aldrich, St. Louis, MO, USA). For quantification, peak areas of selected specific ions were integrated for each compound and normalized by comparison with the peak area of the same compound in a reference sample injected regularly, in order to correct for variations in detector sensitivity and fiber aging. This reference sample consisted of a homogeneous mixture of all the samples analyzed. Data for each sample were expressed as the relative content of each metabolite compared to those in the SM reference.

#### *2.4. Non-Volatile Detection and Quantification*

For both non-volatile P and NP metabolites, two biological replicates and two technical replicates, for two years, were processed and analyzed independently. Prior to analysis, 10 mg of freeze-dried fruit powder from each sample were weighed and extracted (i) with 0.75 mL cold 75% (*v*/*v*) methanol with for 0.5 mg/L formononetin as internal standard (IS) for P metabolites, as previously described [29]; and with (ii) 0.25 mL cold 100% (*v*/*v*), 1 mL of chloroform spiked with 25 mg/L α-tocopherol acetate as internal standard and 0.25 mL 50 mM Tris buffer (pH 7.5, containing 1 M NaCl) as described in [30]. Liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS) conditions were as previously reported for, respectively, polar [31] and non-polar [32] metabolomes.

Metabolite identification was performed by comparing chromatographic and spectral/MS properties with authentic standards, if available, and reference spectra, and based on the *m*/*z* accurate masses as found in the Pubchem database for monoisotopic masses, or in the Metabolomics Fiehn Lab Mass Spectrometry Adduct Calculator for adduct ions. Quantification of each metabolite was carried by calculating the relative contents to the formononetin (P) and α-tocopherol acetate (NP) IS levels.

#### *2.5. Statistical and Bioinformatics Analyses*

Raw data were firstly inspected and manually curated for the presence of outliers (e.g., when % st.dev./avg exceed 30%). For principal component analysis (PCA), the complete dataset after log2 transformation and including all replicates was considered. Untargeted analysis of VOC, and NP and P metabolomes was carried out as previously reported using, respectively, MetAlign and the SIEVE software (v2.2, ThermoFisher Scientific, Waltham, MA, USA; [33]).

As untargeted analysis revealed a consistent year effect, the "Gen\*Year" interaction was investigated by two-way multivariate analysis of variance (MANOVA) on those metabolites than presented less than 30% missing values. The analysis was performed with the PROC GLM procedure and the MANOVA statement implemented in the SAS software package (v9.4M6, SAS® University Edition, SAS Institute Inc., North Carolina State, USA). Since "Gen\*Year" interaction was found to be the less consistent source of variation, allowance was made for the existing interaction, data were mediated over the two years and all genotypes were presented with a single mean value. PCA was performed with SIMCA-P version 11 (Umetrics, Umea, Sweden) with Unit Variance normalization. The differences between each line and the SM reference were assessed using Student's *t*-test at the 5% significance level (*p* < 0.05). Graphs were elaborated with Excel (Microsoft Office 2013, Microsoft Corporation, Washington, DC, USA).

Venn diagrams were generated using Venny software (https://bioinfogp.cnb.csic.es/tools/venny/ index.html, v2.1). Correlation networks were generated using average values over the two years under study, as previously described [34,35]. To better evaluate most robust metabolite-metabolite associations (e.g., significant correlations), the MCODE Cytoscape plugin was used [36].

#### **3. Results**

#### *3.1. Untargeted Analysis of Volatile, Non-Polar and Polar Metabolites*

Untargeted metabolomics aims to gather information on global metabolic profiles by retrieving, in an unsupervised way, as many metabolites are detectable in a GC-MS/LC-HRMS chromatograms. The comparison of the entire VOC, NP and P metabolome between the SM control and the six mutated lines detected the total features of their metabolic profile differences, setting the stage for a more specific targeted metabolomics study later in this paper. By using this approach, 263 VOCs, 746 NP and 110 P compounds were identified in the samples, many of which were differentially accumulated in, at least, one pairwise comparison (Table S1). For VOCs, the first two principal components explained over 51% of total variance; PC1 separated *gf* and *y* from the other lines, while PC2 kept *r* plus all the combinations harboring the *y* mutation distinct from the others (Figure 1a). In Figure 1b, PCA of the first two components for the NP compounds explained almost 53% of the total variance; SM and two *green flesh* genotypes were clearly separated from *y* mutants plus *r* by the PC1, and *r* mutants were grouped in the lower quadrants by the PC2. For P metabolites, the first two PCA components explained the over 55% of total variance (Figure 1c); PC1 kept more clearly SM and the lines carrying *r* separated from *y* mutants, and *y\_r* was positioned exactly halfway between its parental lines. PC2 separated lines carrying *yellow flesh*, plus *y* from SM, which was in the upper side of the graph together with *gf* and *y\_gf*. Overall, the untargeted metabolomic analysis revealed that the mutations mostly affected NP rather than VOC and P metabolome, with the *r* mutant showing the highest extent of changes, including 112 NP compounds (106 down- and 6 up-regulated; Table S1). In addition, *r* also displayed the highest epistatic attitude towards *y* (in the VOC and NP fractions) and *gf* (P metabolome), while, notably, *y* was epistatic to *gf* in the NP untargeted metabolome.

**Figure 1.** PC1 × PC2 score plots of the six mutated lines plus San Marzano (SM) according to relative values of 263 VOCs (**a**), 746 NP (**b**) and 110 P (**c**) metabolites measured by GC-LS and LC-HRMS. Line symbols are explained in Table 1.

#### *3.2. Estimation of "Gen\*Year" Interaction in the Quantification of Targeted Metabolites*

Analysis of untargeted metabolites revealed the presence of a consistent Year effect (Figure S1). Seventy-eight VOCs, 33 NP and 69 P metabolites were independently subjected to multivariate ANOVA; the interaction was not significant for VOCs, but it was highly significant for NP and significant for P compounds (Table S2). "Gen\*Year" interaction was found to be the least consistent source of variation; therefore, data were mediated over the two years and all genotypes were presented with a single mean value in targeted analyses.

#### *3.3. Targeted Analysis of Volatile Compounds*

In order to give a more specific characterization of the flavor, volatile composition of each of the mutated lines in comparison with the wild-type SM was carried out. The selected analytical strategy allowed the relative quantification of 68 VOCs unequivocally identified by both mass spectra and retention index with those of authentic standards (Table S3). Overall, eight compounds were related to benzenoids (B), ten to branched-chain amino acid-relatives (BCAA), nine to apocarotenoids (C), two to esters (E), twenty-four to fatty acids derivatives (L), four to phenylalanine derivatives (Phe), two to sulfur compounds (S) and six to monoterpenoids (T).

PCA of the volatile composition revealed that the first two components explained about the 54% of the total variance; the score plot showed the position of the double mutants, related to their parental lines, and with respect to SM (Figure 2a). Indeed, *r* and *y\_r* were co-located in the same dial, in agreement with the VOC untargeted metabolome plot; conversely, *r\_gf* placed halfway between *r* and *gf* according to PC1 (Figure 2a). Moreover, PC1 kept the mutants *r* and *y\_r* separated from the other lines. PC2 placed *y\_r* and its parental lines in the upper side of the graph, together with SM. The corresponding loading plot was able to identify groups of metabolites, often belonging to the same metabolic pathways, as apocarotenoids (C, in red) and lipids (L, in light blue), terpenoids (T, in green blue) and branched-chain amino acid derivatives (BCAA, in blue; Figure 2b). A comparison between the score and the loading plots revealed the overall compositional differences between the mutated lines.

**Figure 2.** PCA of log2 values of 68 volatile compounds measured by a solid-phase micro-extraction gas-chromatography coupled to mass spectrometry (HS-SPME/GC-MS). (**a**) PC1 × PC2 score plot of the six mutated lines plus San Marzano (SM). (**b**) PC1 X PC2 loading plot. Line (a) and metabolite class (b) abbreviations are explained in Tables 1 and 2 respectively.

Indeed, one of the most obvious features was that all the lines harboring the *yellow flesh* mutation (*r*, *y\_r* and *r\_gf*) and, in a lesser extent, also the double mutant *y\_gf,* were characterized by producing lower levels of volatile apocarotenoids, which was particularly dramatic in the case of some linear apocarotenoids such as 6-methyl-5-hepten-2-one (Figure 3). In the case of the mutants for *yellow flesh* this was in accordance with the scarcity of their carotenoid precursors. These lines also showed lower levels of several phenylalanine derivatives. A characteristic feature of the lines *y*, *r* and *y\_r* was the high production of fatty acid derivatives together with low levels of branched-chain amino acid-related volatiles, conversely to double mutant lines *y\_gf* and *r\_gf*, which showed the opposite pattern. Finally, *y* and *gf* lines were characterized by higher levels of apocarotenoids and terpenoids (Figure 2b).


**Table 2.** Number of compounds in the different categories and classes of metabolites that are significantly different from San Marzano (SM) in each of the six lines under study.

\* Undefined compounds.

To further investigate the volatile compounds content of each of the mutated line in comparison with their original parental SM, a *t*-test analysis was performed. Out of 68 VOCs identified, the line with the highest number of compounds significantly different from SM was *y*, mainly because of differences in fatty acid derivatives, among other VOCs (Table 2). The lines containing the *r* mutation strongly differed for apocarotenoid volatiles, a group of metabolites considered to be involved in tomato flavor [37], representing the metabolic pathways most notably altered in these ILs (Figure 3). Some volatiles, such as the benzenoid eugenol, had higher levels in all mutant lines (Figure 3a). Mutants carrying *yellow flesh* had lower levels of the apocarotenoid β-ionone (Figure 3b), as well as 6-methyl-5-hepten-2-one, where *y\_gf* acquired lower levels too (Figure 3c). Furthermore, *r*, *y* and their double mutant had higher levels of many fatty acid derivatives, such as *(E)*-2-pentenal (Figure 3d) and 1-penten-3-one (Figure 3e). Lastly, *y* was enriched in phenylalanine derivatives, such as 2-phenylethanol (Figure 3f).

**Figure 3.** Relative log2 variation in selected volatiles involved in tomato flavor: (**a**) eugenol, (**b**) β-ionone, (**c**) 6-methyl-5-hepten-2-one, (**d**) *(E)*-2-pentenal, (**e**) 1-penten-3-one, (**f**) 2-phenylethanol in the six fruit mutant lines in San Marzano (SM) background compared with the recurrent parent. Line symbols are reported in Table 1. Bars colored in grey and black indicate means significantly lower and higher than SM for *p* ≤ 0.05 after Student's *t*-test, respectively.

#### *3.4. Targeted Analysis of Non-Polar Metabolites*

To investigate changes at the NP specialized metabolome, LC-HRMS was used to determine the level of 54 known and previously validated compounds. They were divided in different metabolic classes, including 14 fatty acids (FA), one phospholipid (PHO), two sterols (STE), 15 carotenoids (CAR), eight chlorophylls (CHL), six quinones (QUI), and five tocochromanols (TOC; Table S4). The score plot of the first two PCA components explained about the 57% of the total variance, with double mutants differently spaced from their parental lines (Figure 4a). Indeed, PC1 kept the mutants carrying *r* separated from SM and the other lines. PC2 clearly separated *y\_r* and its parental lines from the three mutants carrying *gf* plus SM, partially confirming untargeted metabolomics results. The loading plot grouped metabolites belonging to the same metabolic pathways (Figure 4b). A comparison between the score and the loading plots revealed the compositional differences between the mutated lines. Indeed, the chlorophyll group in the upper side of PC2 characterized the *green flesh* genotypes, while two quinones were in the lower side of PC2, in correspondence of the *yellow flesh* genotypes. The carotenoid group was split into early carotenoids (up to lycopene) and xanthophylls at the opposite sides of PC1; they were respectively decreased and increased in the *yellow flesh* mutant lines (Figure 4b).

**Figure 4.** PCA of log2 values of 54 NP (**a**,**b**) and 128 P (**c**,**d**) metabolites measured by LC-HRMS. (**a**–**c**) PC1 × PC2 score plot of the six mutated lines plus San Marzano (SM). (**b**–**d**) PC1 × PC2 corresponding loading plots. Line (**a**,**c**) and metabolite class (**b**,**d**) abbreviations are explained in Tables 1 and 2 respectively.

To further investigate the metabolic changes of each mutated line in comparison with the original parental SM, a *t*-test analysis was performed (Table 2; Table S4), and we particularly focused on metabolites with sensorial (color, taste) and health-related properties. The lines carrying *r* showed the highest number of NP compounds different from SM, mainly due, as expected, to differences in carotenoids (Table 2). The lower number of differences was shown by the single mutant *y*, indicating that this genotype is more similar to SM for NP targeted compounds.

*r*, *r\_gf* and *y\_r* were characterized by levels of phytoene (Figure 5a), β-carotene (Figure 5b) and lycopene (Figure 5c) lower than SM and the other lines, in agreement with previous reports; *r*, *r\_gf*, *y\_gf*, and *y\_r* reported higher levels of the xanthophylls all-trans-neoxanthin (Figure 5d) and luteoxanthin (Table S4). At chlorophyll metabolism level, *gf* was characterized by higher contents of both chlorophyll a (Table S4) and b, with the latter also higher in *y\_gf* and *r\_gf* (Figure 5e). Moreover, *r\_gf* showed higher levels of pheophytin a and pheophorbide a (Table S4). Drawing the attention on quinones, plastoquinone increased in lines carrying *r* (Figure 5f) and plastoquinol-9 in *r* and *y\_r* (Table S4). Lastly, δ-tocopherol amount was higher than SM only in *r\_gf*, while γ-tocopherol and β-tocopherol enhanced in *r\_gf* and *y\_gf* (Figure 5g,h).

**Figure 5.** Relative log2 variation in (**a**) phytoene, (**b**) β-carotene, (**c**) lycopene, (**d**) all-trans-neoxanthin, (**e**) chlorophyll b, (**f**) plastoquinone, (**g**) δ-tocopherol, (**h**) β-/γ-tocopherol, of six fruit mutant lines in San Marzano (SM) background compared with the recurrent parent. Line symbols are reported in Table 1. Bars colored in grey and black indicate means significantly lower and higher than SM for *p* ≤ 0.05 after Student's *t*-test, respectively.

#### *3.5. Targeted Analysis of Polar Metabolites*

The relative quantification of 128 polar metabolites allowed to complete the metabolomics characterization of the six mutants under study. P metabolites were divided into different metabolic classes, including 19 amino acids (AA), 17 acids (AC), four amines (AM), two lipids (LI), one nucleic acid (NU), 15 sugars and polyols (SAP), 11 alkaloids (ALK), 55 phenylpropanoids (PHE) and three vitamins (VIT; Table S5). The score plot of the first two components explained about the 52% of the total variance, with PC1 that particularly separated *r* and *r\_gf* from all the other lines, and PC2 that identified a group including *gf* and *r\_gf* together with SM (Figure 4c). Interestingly, P untargeted and targeted metabolomes differently separated the mutants under study, providing clues about a large extent of distinct metabolic components contributing to their chemical profiles. The loading plot grouped metabolites belonging to the same metabolic pathway (Figure 4d). By the comparison of the score and the loading plots, the position of some metabolites in relation to the lines studied was highlighted; indeed, many kaempferols and quercetins were in the PHE group corresponding to the area of the *y* mutants. On the contrary, many naringenins grouped in the opposite side. Most AAs were grouped together, matching with *r* and in opposition to the SAP group (Figure 4d).

To investigate the P metabolite content of each of the mutated lines in comparison with SM, a *t*-test analysis was performed (Table 2; Table S5), giving emphasis to nutritional- and sensorial attribute-related molecules. The line with the highest number of differentially accumulated polar

compounds was the *y* single mutant, with a preponderance of down-regulated metabolites, as expected; the PHE group mostly contributed to this diversity (Table 2). On the contrary, *gf* was the line more like SM. Notably, lines containing *r* showed a higher number of AA over SM (Table 2).

As already highlighted in the corresponding loading plot (Figure 4d), lines carrying *y* had lower levels of the PHE naringenin chalcone glucoside (Figure 6a), and conversely higher levels of kaempferol-hexose deoxyhexose-pentose compared to SM, a biochemical phenotype also observed in *gf* and *r\_gf* too for the latter (Figure 6b). Regarding the ALK group, calystegine B1 resulted statistically lower in all the lines, with exception of *gf* and *y\_gf*, which however displayed reduced amounts compared to SM (Figure 6c). In addition, a series of primary metabolites were characterized by higher levels in the mutants under study: for example, the AA proline in *r* and *y\_r* (Figure 6d) and the VIT nicotinamide in *gf*, *r* and *y\_r* (Figure 6e). Similarly, the SAP glucoheptulose was higher in *gf*, whereas all the other lines were more similar to SM (Figure 6f). The LI phosphocoline displayed lower levels in lines carrying the *y* mutation (Figure 6g), while the AC sinapinic acid was higher in all lines, with the only exception of *r* (Figure 6h).

**Figure 6.** Relative log2 variation in (**a**) naringenin chalcone glucoside, (**b**) kaempferol-hexose-deoxyhexose-pentose, (**c**) calystegine B1, (**d**) proline, (**e**) nicotinamide, (**f**) glucoheptulose, (**g**) phosphocoline, (**h**) sinapinic acid, of six fruit mutant lines in San Marzano (SM) background compared with the recurrent parent. Line symbols are reported in Table 1. Bars colored in grey and black indicate means significantly lower and higher than SM for *p* ≤ 0.05 after Student's *t*-test, respectively.

#### *3.6. Bioinformatics to Investigate Metabolite-Metabolite Relationships*

Bioinformatic approaches, including Venn diagram visualization and correlation network analysis, were used in order to achieve a deeper understanding of the biochemical perturbations and relationships occurring in the SM mutants under study. Venn diagrams showed the degree of overlap for VOC, NP and P metabolites in each double mutant and its two parental single mutants. For line-specific metabolites in each group, all lines, except *y\_gf* and *y\_r*, showed a higher number of significantly up-regulated metabolites (Figure 7). When the overlaps between single mutants and the respective double combinations were considered, the epistasis of *r* over *gf* (29 metabolites were in common between *r* and *r\_gf*) and *y* (28 metabolites in common between *r* and *y\_r*) and of *y* over *gf* (21 metabolites in common between *y* and *y\_gf*) were found (Figure 7).

**Figure 7.** Venn diagrams for (**a**) *gf*, *r* and *r\_gf*; (**b**) *gf*, *y* and *y\_gf*; (**c**) *r*, *y* and *y\_r*. Red and black numbers correspond to up- and down-regulated metabolites, respectively**.**

Furthermore, we used a correlation network approach to investigate mutation-induced alteration at VOC, NP and P metabolome levels. To this purpose, three networks were built by integrating all differentially accumulated metabolites in, at least, each single mutant and the corresponding double mutants (Figure 8). Overall, the three force-directed networks allowed the achievement of specific topologies according to the distribution of the significant correlation networks existing in each metabolite-metabolite interaction; notably, a high extent of conservation was observed in each network, either in the direct (PHE in the *y*- and CAR in the *r*-yielding networks) or not direct (L volatiles in *r, y* and *r\_y;* and AA in *r*, *gf* and *r\_gf*) targets of the mutations. In order to evidence the more robust and strongest correlations, the MCODE Cytoscape plugin was applied to each of the three networks (Figures S2–S4). In this way, it was possible to identify the highly interconnected regions, which resulted particularly abundant in the *r*/*gf* mutants, with a lower number of greatly dense clusters including primary (SAP, AA) and secondary (isoprenoids as CAR, CHL, QUI, T and PHE) compounds; on the contrary, the *y*/*r* mutants were characterized by a higher number of clusters with lower density, and with a high representation of amino and organic acids, besides CAR and PHE.

**Figure 8.** Correlation networks of metabolomics data in (**a**) *y, gf* and *y\_gf*; (**b**) *y, r* and *y\_r*; (**c**) *r, gf* and *r\_gf* mutants. All the data from volatile (VOC), non-polar (NP) and polar (P) analyses were used as fold change to the San Marzano (SM) reference. Network topology was directed by the force of Pearson correlation coefficient index (Table S6). Each node represents VOC (turquoise circle), a NP (diamond) or a P (triangle) metabolite. Lines joining the nodes represent positive (red) or negative (blue) correlations. Node sizes are proportional to the respective node strengths, which are shown in Tables S7–S9. Node color is depending on the metabolic class of each compound as indicated in the figure. Number of nodes (n) and network strength (NS) are shown at the top of each network. Only correlations with |ρ| > 0.95 are shown (*p* ≤ 0.05).

#### **4. Discussion**

3HQWDGHFDQRLFDFLG

This study focused on the analysis of the volatile and non-volatile compounds of six tomato lines introgressing mutations that have been selected as representative for affecting the main classes of fruit pigments. Interest towards these mutations is proved by the increasing introduction of cultivars with yellow, pink and brown fruits in the fresh tomato market and by scientific studies addressing the properties of single [38,39] and multiple [40] mutants. The characterization of the biochemical effects of these mutations was based on the comparison with the SM original parent, with the final aim of defining whether the lines showing aesthetic novelty were also diversified for organoleptic and nutritional characteristics.

When the studied compounds were considered all together, *y* showed the highest level of variation. Venn diagrams indicated epistasis of *r* over both *gf* and *y*, as highlighted by the univariate analysis. Interestingly, when integrated and subjected to a correlation network analysis, differentially accumulated metabolites in any of the six single and double mutant ILs exhibited a strong level of coordination: indeed, irrespectively to the metabolic class object of the mutation, metabolites acting in the same pathway clustered together, thus indicating a great conservative capacity of the fruit metabolism in its mutation-derived reorganization. This finding is consistent with previous reports showing, either in tomato and grape, a general phenotypic-metabolic plasticity in response to genetic or environmental changes [24,30,41]. However, looking at the sub-clusters generated by each network, distinct and specific relationships were unraveled, with the mutants carrying the *y* and *r* mutations involving the largest number of metabolites belonging to highly diversified pathways.

#### *4.1. Biochemical Changes in Fruits of Yellow Flesh and of Its Combinations with y and gf*

Mutants containing *r* were clearly separated from the other lines by NP targeted metabolites, mainly due to the compositional differences of carotenoids. The double mutant *r\_gf* mapped in a distinct position compared with *r* and *y\_r*, indicating partially additive and partially synergistic effects of this combination. The primary effect of the *r* mutation in the carotenoid pathway was to strongly reduce phytoene, and the colored carotenoids β-carotene and lycopene, as expected [42]. Phytofluene was also reduced in *r* genotypes, as previously described in tomato accessions [43] and in ripening fruits of *Psy1* knockout lines [44]. In the latter case, a concomitant decrease in the volatile apocarotenoids was also observed [44]. In literature, the study of *r* metabolites was often limited to the analysis of carotenoids, including the xanthophyll lutein, whose levels, in line with our results, did not vary [11]. Intriguingly, we evidenced that all-trans-neoxanthin, the last xanthophylls in the carotenoid pathway, in *r* mutants showed levels higher than SM; it is possible that this represents a cellular strategy implemented to compensate for the decrease of other carotenoids. Indeed, xanthophylls act in flowing the energy through the photosynthetic apparatus and protecting organisms against damage caused by photosynthesis itself [45]. In addition, both violaxanthin and neoxanthin are key substrates in ABA biosynthesis, and their variation can influence fruit ripening and the attitude towards abiotic stresses [46]. In this context, enhanced levels of xanthophylls could provide enough metabolic flux to guarantee an adequate ABA production. The same hypothesis can be extended to the analysis of quinones, such as plastoquinone and plastoquinol-9, whose levels increased in *yellow flesh* mutants. The importance of quinones in basic metabolic processes, such as respiration and photosynthesis, has been established [47] and an increase in their content is plausible, in a scenario where it is necessary to compensate for the lack of carotenoids.

Important metabolic changes occurred in the *r* mutants for P compounds; several amino acids positively varied, as proline in the single mutant and in combination with *y*, and valine/norvaline in the single mutant or combined with *gf*. Proline and valine have significant functions in plant cells as plant stress sensor, ABA and polyamines interactors, and precursors of BCAA-derived volatiles [48,49]. Similarly, the vitamin nicotinamide, that plays a primary role in pyridine metabolism [50], also increased in *r*, alone or with *y*.

Carotenoid-derived volatiles have an important role in tomato flavor, as their levels positively correlate with tomato acceptability [51]. Loss-of-function of *Psy1* in *yellow flesh* mutants led to the lack of substrates for apocarotenoid production [52], thus justifying the low amount of C VOCs such as β-ionone [53] and 6-methyl-5-hepten-2-one [54]. Conversely, the L 1-penten-3-one and *(E)*-2-pentenal increased in *r* and *r\_y*, suggesting a role for the lipoxygenase–linoleate (LOX) enzyme that catalyzes the oxidation of polyunsaturated fatty acids by molecular oxygen with the formation of unstable hydroperoxides which in turn oxidize carotenoid pigments [55].

In summary, compensating the depletion of the principal carotenoids, lines carrying the *r* mutation offer peculiar nutraceutical properties, for an increased content of amino acids, vitamins, xanthophylls and quinones. Xanthophylls have been highlighted in recent studies for their positive contribution towards total dietary carotenoid intake [56], whereas quinones showed positive properties in treating cardiovascular diseases, chronic gingivitis and periodontitis and a favorable impact on cancer treatment and human reproductive health [57]. Differently, tomato *r* lines will probably show a lower score for the aroma, considering that some of the apocarotenoids providing floral or fruity notes are less represented in yellow fruits.

#### *4.2. Biochemical Changes in Fruits of Colorless Fruit Epidermis and of Its Combinations with r and gf*

Multivariate analysis of targeted P compounds tightly grouped lines carrying the *y* mutation, an expected result as *y* is characterized by substantial modifications of the class of phenylpropanoids, mainly due to the lack of yellow pigment naringenin chalcone [15]. As also observed in the case of the NP fraction, untargeted and targeted metabolomes did not perfectly matched, suggesting the existence of larger, still unexplored, metabolic changes. The decrease or absence of naringenins also characterized *y* and its two double mutants, leading to the conclusion that *y* is epistatic on *gf* and *r* for this class of compounds. Compared to red SM fruits, kaempferol-hexose-deoxyhexose-pentose was higher in *y* mutants, even if many other kaempferols showed no differences. Similarly, several quercetins showed higher levels in lines carrying *y*. Variations in the PHE pathways could explain the reduced phosphocholine levels detected in the three *y* mutants, since increased activity of pathways generating cuticular lipids in tomato fruit peel precedes that of phenylpropanoid and flavonoid biosynthesis pathways [58].

No substantial differences in carotenoids were found between *y* and SM, except the higher all-trans-neoxanthin levels in *y\_gf* and *y\_r*. Conversely, tocochromanol metabolism was altered in *y* mutant berries as there was a decrease in tocopherols, mainly affecting *y* and *y\_r*, confirming previous reports [15,59].

Among VOCs, differences were found in compounds derived from phenylalanine, such as eugenol and 2-phenylethanol, in line with the known alterations of *y* in the phenylpropanoid branch of phenylalanine catabolism [54]. Indeed, a significant number of VOCs derived from amino acids, are considered to have an influence on tomato flavor and liking, such as guaiacol and eugenol. In addition, 2-phenylethanol is a main contributor to tomato flavor, increasing floral aroma and the perception of sweetness [60,61], it is plausible that increase of these class of volatiles exerts a positive effect on the aroma in lines carrying the *y* mutation. In addition, *y* mutants were characterized by higher production of many fatty acid derivatives, which are the most abundant volatiles produced in the tomato fruit [62]. These VOCs include several C5 compounds such as 1-penten-3-one or (E)-2-pentenal, and C6 volatiles such as 1-hexanol, (*Z*)-3-hexenal, (*E*)-2-hexenal, or hexanal, among others, that are classified as "green leaf" volatiles due to their 'green' characteristic, with a fresh aroma of cut grass. Although some studies suggested a reduced impact on tomato flavor and no effect on consumer preference of these compounds [63,64], others claim their potential impact on overall flavor intensity and liking [37].

In addition to variation of the aroma, *y* mutants are endowed of higher kaempferol and quercetin levels, but also present the drawback of lower naringenin and tocopherols. Since all tocochromanols are potent lipid-soluble antioxidants and are essential dietary nutrients for mammals as vitamin E [65], this aspect represents a negative trait in *y* tomato lines.

Therefore, the *y* mutation is to be considered as a source of more aesthetic (color) and organoleptic (aroma) than nutritional novelty. Although no clear relationship between colorless epidermis and fruit shelf-life, peculiar mechanical properties of the *y* epicarp were manifested by the fact that the peel of the mutant fruit was richer of lignin [15,66]. Intriguingly, the SM lines carrying *y*, alone or in combination, showed a higher resistance to storage, indicating that pigment variation in the peel implicates different mechanical properties and post-harvest behavior of the fruit [26].

#### *4.3. Biochemical Changes in Fruits of Green Flesh and of Its Combinations with r and y*

The group of *green flesh* mutants showed no changes in carotenoids, but as expected, it was primarily affected in the content of chlorophylls. *gf* belongs to the so-called cosmetic-type SGR mutants, that have been described in several plant species to retain substantial amount of chlorophylls in fruits during ripening, maintaining other ripening-related metabolites, such as lycopene, unchanged [67]. Accordingly, all the SM lines containing *gf* showed fruits with higher amounts of chlorophylls. Among other NP metabolites, several tocopherols increased in *gf* single and double mutants, confirming previous data that showed higher tocopherol levels in *gf* [68,69] and *r\_gf* (A. Mazzucato and G.P. Soressi, unpublished results) mutants.

For P metabolites, *gf* was similar to SM, with few notable exceptions involving flavonoids, such as kaempferols, and sugars, such as glucoheptulose.

No substantial variations were found for *gf* mutants in VOCs except for a high eugenol content that was paralleled by the other lines.

Because of the positive effects of chlorophylls and chlorophyll-related metabolites on cellular inflammation and as anti-mutagen and anti-carcinogen agent [70,71], and of the antioxidant function of vitamin E-producing metabolites, that reduce free-radical damage to membrane lipids by scavenging peroxyl radicals [72,73], tomato lines carrying *gf* may be considered an alternative to red tomatoes offering added nutraceutical value. In addition, *gf* may play important roles in enhancing organoleptic qualities. Although a direct, positive effect of *gf* on soluble solids content has not clearly been established, an SlGLK2-enhanced chlorophyll content in immature green fruits led to an increment in total soluble solid content in ripe fruits, possibly by the positive regulation of sugar metabolism enzyme-encoding genes [69]. This effect will be additive to that of other taste-related compounds, such as glutamate [20]. In our *gf* lines, glutamic acid was not significantly different from SM, but, considering only data in one year, it was substantially higher in all lines. All these properties justify the interest towards breeding new *gf* tomato lines and the present success in the market promises a further development in the near future.

#### **5. Conclusions**

This work was based on the study of the effects of mutations with an emerging commercial interest, compared within the traditional Italian tomato variety San Marzano. This biochemical and bioinformatics characterization has given further insights on the effect of each mutation on fruit aesthetic, flavor and nutritional composition. The analysis of the respective three double mutants offered an added value, making it possible to establish epistatic or synergistic effects between each pair of mutations and represented a starting point for breeding new tomato lines with different phenotypes. Although *r* and *y* cause the decrement of the most important classes of health-related pigments (carotenoids and flavonoids), the compensating increase of other metabolites with nutraceutical (xanthophylls, tocopherols, amino acids) or flavor-related (phenylalanine and fatty acid derivatives) positive properties make the studied lines worthy of attention for breeding novel and better tomatoes. Novel phenotypes could take advantage of new breeding techniques as genome editing to recapitulate the original mutations in different tomato backgrounds, or in different species, and obtain the wanted variation directly avoiding backcrossing and linkage drag effects. Ultimately, advanced breeding programs will convert the new lines into novel elite varieties of commercial interest.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/10/3/110/s1, Figure S1: PC1 × PC2 score plots of the six mutated lines plus San Marzano (SM), sampled in two different years, according to relative values of 263 volatile (**a**), 746 non-polar (**b**) and 110 polar (**c**) metabolites measured by a solid-phase micro-extraction gas-chromatography coupled with a mass spectrometry instrument (HS-SPME-GC-MS), Figure S2: Subnetworks generated from the *y, gf* and *y\_gf* global network of Figure 8, using the MCODE Cytoscape plugin. Each subnetwork highlights the highest densely connected regions present in the global network. Each node represents a VOC (turquoise circle), non-polar (diamond) or a polar metabolite (triangle). Lines joining the nodes represent correlations; direct and inverse correlations are shown in red and blue, respectively. Node sizes are proportional to the corresponding node strengths (nss), which are shown in

Supplemental Table S7. Node color is depending on the metabolic class of each compound as indicated in the figure. Number of nodes (n) and network strength (NS) are shown on top of each network. Only correlations with |ρ| > 0.95 are shown (*p*-value 0.05), Figure S3: Subnetworks generated from the *y, r* and *y\_r* global network of Figure 8 using the MCODE Cytoscape plugin. Each subnetwork highlights the highest densely connected regions present in the global network. Each node represents a VOC (turquoise circle), non-polar (diamond) or a polar metabolite (triangle). Lines joining the nodes represent correlations; direct and inverse correlations are shown in red and blue, respectively. Node sizes are proportional to the corresponding node strengths (nss), which are shown in Supplemental Table S7. Node color is depending on the metabolic class of each compound as indicated in the figure. Number of nodes (n) and network strength (NS) are shown on top of each network. Only correlations with |ρ| > 0.95 are shown (*p*-value 0.05), Figure S4: Subnetworks generated from the *r, gf* and *r\_gf* global network of Figure 8 using the MCODE Cytoscape plugin. Each subnetwork highlights the highest densely connected regions present in the global network. Each node represents a VOC (turquoise circle), non-polar (diamond) or a polar metabolite (triangle). Lines joining the nodes represent correlations; direct and inverse correlations are shown in red and blue, respectively. Node sizes are proportional to the corresponding node strengths (nss), which are shown in Supplemental Table S7. Node color is depending on the metabolic class of each compound as indicated in the figure. Number of nodes (n) and network strength (NS) are shown on top of each network. Only correlations with |ρ| > 0.95 are shown (*p*-value 0.05), Table S1: Number of volatile (VOC), polar (P) and non-polar (NP) untargeted metabolites significantly different from Sam Marzano (SM) in the six introgression lines. Line symbols are explained in Table 1, Table S2: Fisher's F test (F) and relative degree of significance (P) after two-way multivariate analysis of variance (MANOVA) for three main categories of metabolites analyzed, Table S3: Mean (M) and standard deviation (SD) of volatile compounds in two years for the fold change of individual values over the San Marzano value and significance of Student's t test between each mutant line and San Marzano. Line symbols are explained in Table 1, Table S4: Mean (M) and standard deviation (SD) of non-polar compounds in two years for the fold change of individual values over the San Marzano value and significance of Student's t test between each mutant line and San Marzano. Line symbols are explained in Table 1, Table S5: Mean (M) and standard deviation (SD) of polar compounds in two years for the fold change of individual values over the San Marzano value and significance of Student's t test between each mutant line and San Marzano. Line symbols are explained in Table 1, Table S6: Metabolite-metabolite significant correlations (Pearson coefficient; r > |0.95|, *p*-value < 0.05) in tomato single mutant and the corresponding double mutants (*r*, *y* and *r\_y*; *r*, *gf* and *r\_gf*; *y*, *gf* and *y\_gf*), Table S7: Node table of y, *gf* and *y\_gf* double mutant Network. For each node the corresponding node strength (ns) is reported, Table S8: Node table of *y*, r and *y\_r* double mutant Network. For each node the corresponding node strength (ns) is reported, Table S9: Node table of gf, r and *r\_gf* double mutant Network. For each node the corresponding node strength (ns) is reported.

**Author Contributions:** A.M., G.D. (Gianfranco Diretto) and A.G. conceptualized and designed the experiments, G.D. (Gabriella Dono) G.D. (Gianfranco Diretto) and J.L.R. performed the experiments, G.D. (Gabriella Dono), G.D. (Gianfranco Diretto), S.F. and J.L.R. performed statistical analysis, A.M. and G.D. (Gabriella Dono) drafted the manuscript. All the authors reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the European Commission through-H2020 SFS-7a-2014 TRADITOM (634561) and the COST Actions ROXy CA18210 and EUcarotene CA15136 for networking. The Italian Ministry for Education, University and Research (MIUR) Law 232/216, Department of Excellence and the Spanish Ministry project BIO2016-78601-R are also acknowledged.

**Acknowledgments:** The authors acknowledge M.E. Picarella and the Metabolomics Lab at IBMCP for expert technical assistance.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Placenta, Pericarp, and Seeds of Tabasco Chili Pepper Fruits Show a Contrasting Diversity of Bioactive Metabolites**

#### **Felipe Cervantes-Hernández, Paul Alcalá-González, Octavio Martínez and José Juan Ordaz-Ortiz \***

Unidad de Genómica Avanzada, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV), Km. 9.6, Libramiento Norte Carretera Irapuato-León, Irapuato, Gto. 36824, Mexico; felipe.cervantes@cinvestav.mx (F.C.-H.); paulpascual92@hotmail.com (P.A.-G.); octavio.martinez@cinvestav.mx (O.M.)

**\*** Correspondence: jose.ordaz.ortiz@cinvestav.mx

Received: 26 August 2019; Accepted: 23 September 2019; Published: 28 September 2019

**Abstract:** Chili pepper (*Capsicum* spp.) is one of the most important horticultural crops worldwide, and its unique organoleptic properties and health benefits have been established for centuries. However, there is little knowledge about how metabolites are distributed throughout fruit parts. This work focuses on the use of liquid chromatography coupled with high resolution mass spectrometry (UHPLC-ESI-HRMS) to estimate the global metabolite profiles of the pericarp, placenta, and seeds of Tabasco pepper fruits (*Capsicum frutescens* L.) at the red mature stage of ripening. Our main results putatively identified 60 differential compounds between these tissues and seeds. Firstly, we found that pericarp has a higher content of glycosides, showing on average a fold change of 5 and a fold change of 14 for terpenoids when compared with other parts of the fruit. While placenta was the richest tissue in capsaicinoid-related compounds, alkaloids, and tocopherols, with a 35, 3, and 7 fold change, respectively. However, the seeds were richer in fatty acids and saponins with fold changes of 86 and 224, respectively. Therefore, our study demonstrates that a non-targeted metabolomic approach may help to improve our understanding of unexplored areas of plant metabolism and also may be the starting point for a detailed analysis in complex plant parts, such as fruits.

**Keywords:** *Capsicum frutescens* L.; non-targeted metabolomics; secondary metabolism; Liquid Chromatography coupled to Mass Spectrometry (LC-MS)

#### **1. Introduction**

Chili pepper (*Capsicum* spp.) is one of the most important crops worldwide. It is used as a main ingredient for many dishes in different cultures, such as Asian, Latin-American, and Mediterranean cultures, due to its organoleptic properties [1]. There are 40 accepted chili species but only five are considered domesticated: *C. annuum*, *C. chinense* Jacq, *C. frutescens*, *C. baccatum*, and *C. pubescens* [2]. In 2017, 36 million tons of chili peppers were produced globally, with Mexico being the second largest producer [3]. Wild pepper populations of Tabasco pepper (*C. frutescens* L.) and *C. annuum* can be found in some states of Mexico, increasing the relevance for collecting and characterizing these species as resources for the breeding of cultivated peppers [4,5]. Previous studies have extensively described metabolite diversity in *C. annuum* [6,7] but very little is known on *C. chinense* and *C. frutescens*. Most of these studies have undertaken a targeted approach, where the main focus has been to quantitate for specific metabolites, such as capsiate, dihydrocapsiate, capsaicin, dihydrocapsaicin, carotenoids, fatty acids, and amino acids [8,9]. To date, there are no studies with a comprehensive global profiling of tissue specific in *C. frutescens* and *C. chinense*.

*Capsicum* species are known to be rich in compounds such as capsaicinoids, capsinoids, carotenoids, flavonoids, vitamins, essential oils, and other phytochemicals, which provide a unique taste, aromatic properties, and health benefits [10–14]. Capsaicinoids consist mainly of two congeners, capsaicin and dihydrocapsaicin. Capsinoids also have two major analogues, capsiate and dihydrocapsiate. However, there are structural differences: Capsaicinoids are fatty acid amides linked with vanillylamine, whereas capsinoids are fatty acid esters linked to vanillyl alcohol [15,16]. Several compounds identified in the *Capsicum* species have been studied, because the medicinal potential examples of these compounds are icariside E5, capsosides, and capsianosides [17]; hydroxycinnamic derivatives, O-glycosides of quercetin, luteolin, and chrysoeriol [7]. The reported medicinal benefits are related to areas, such as anti-inflammatory [18], anti-cancer [19], anti-microbial, antioxidant properties [1,20,21], and those with weight-loss properties [22]. In addition, some epidemiological studies of a number of these antioxidants reported that they possess anti-atherosclerotic, antitumor, antimutagenic or anticarcinogenic activity [23–25]. It may be that these properties have the ability to help us to address the identification, isolation, and production of nutraceutical compounds or new natural medicinal compounds.

Nevertheless, the location and relative abundance of these metabolites and their precursors in different parts of the chili pepper fruit, such as the pericarp, placenta and seed, remain unclear. It is known that some compounds are synthesized and accumulated into specific tissues in the *Capsicum* genus [1,26]; for example, capsaicin is synthesized mainly in the placenta [27,28], while anthocyanins are described as being accumulated in pericarp during fruit development [29]. Materska, reported a placental and pericarp comparison in chili pepper fruit, where placenta was the richest in flavonoids, while the pericarp presented a larger diversity in glycosylated compounds. Despite this, little is known about the tissue specific spatial-temporal location of other classes of compounds and products of the secondary metabolism in *Capsicum* fruits [30]. Investigating the metabolic diversity on fruit tissues is essential in order to gain a comprehensive understanding of the function of specific parts of the fruit at the fundamental level. Consequently, this will enable the possible exploitation of natural products either for pharmaceutical or food products. In the past, these have been done with histochemical methods, by staining tissues sections with various chemical to reveal the presence of specific compounds either by visual or microscope inspection [31].

Metabolomics is defined as the comprehensive analysis of all low molecular weight organic compounds (<1500 Da) in a biological system [32]. Mass spectrometry has become the most widely applied platform for metabolomics, due to the wide range of molecules that can be analyzed on a single run [33]. Global profiling or non-targeted mass spectrometry-based metabolomics have gained importance in the study of crop species and have been applied to investigate potato, tomato, rice, wheat, strawberry, cucumber, and tobacco [34–37]. In the field of plant metabolomics, liquid chromatography coupled with electrospray ionization high resolution mass spectrometry (UPLC-ESI-HRMS) has emerged as the technique of choice for the putative identification of metabolites in complex matrices. This technique has been widely used, due to its sensitivity, selectivity, and analysis capability [38,39]. Nevertheless, metabolite identification for unknown compounds still remains a big challenge to overcome. In that respect, the recommendations by the Metabolomics Standards Initiative (MSI) recognize five different levels for metabolite confidence annotations. Level 0 requires the full compound 3D structure and stereochemistry information. Levels which are more common include: Level 1 identifications need a confirmation by two orthogonal parameters such as retention time and MS/MS spectrum, normally with match reference standards; and Level 2 requires at least two orthogonal pieces of information, including evidence that excludes all other candidates. Data for Level 2 should describe probable structure and be matched to literature data or databases by diagnostic evidence [40].

Combining existing bioinformatic tools with high resolution mass spectrometry data can reveal unclear relationships of metabolites and their possible function at a spatial-temporal distribution level. As a first attempt to construct the chili fruit metabolome, we produced a hand curated dataset,

that contains 60 putative identified metabolites, which include alkaloids and terpenoids that are unreported in *Capsicum*, with significant differences and relative abundances between three sections (pericarp, placenta, and seed) of Tabasco pepper fruit at the mature red stage, using an UHPLC-QTOF-HRMS platform, combined with the use of Progenesis QI for small molecules, as a tool for the pre-identification of unknown metabolites. Our results underline the global metabolic differences in complexity, mainly based on the secondary metabolism of these fruit parts.

#### **2. Results**

#### *2.1. Non-Targeted Metabolomic Analysis*

Our data for two tissues and the seed of Tabasco chili fruit comprised a total of 87 files or chromatograms per ionization mode (positive and negative polarities) as shown in Figure 1 and were uploaded into Progenesis QI. The dataset was first aligned (retention time): Each chromatogram was aligned against each other and automatically compared to a reference profile selected by the software that contained the highest number of features (potential compounds). Then, peak picking was performed by default parameters. A total of 1980 features were detected in the aqueous phase and 1481 were detected in the diethyl ether extract, for both ionization modes. Figure 1 shows a typical chromatographic profile in positive and negative ionization mode of placenta tissue, displaying some representative compounds of chili pepper fruit.

**Figure 1.** (**a**) Base peak intensity chromatographic profile of placenta tissue from chili pepper fruit on a Charged Surface Hybrid (CSH) C18 column obtained with Electrospray Ionization (ESI) positive mode on a mass range from 100 to 1500. 1. 2-[(1H-Indol-3-ylacetyl) amino]-4-methylpentanoate; 2. Indole-3-acetamide; 3. L-cis-Cyclo (aspartylphenylalanyl); 4. Capsicosin; 5. Yamogenintetroside B; 6. Capsaicin; 7. Dihydrocapsaicin; 8. Homodihydrocapsaicin; 9. β-Carotinal. (**b**) Base peak intensity chromatographic profile of placenta tissue from chili pepper fruit on a CSH C18 column obtained with ESI negative mode. 10. Oleandrigenin monodigitoxoside; 11. β-d-fructofuranosyl 6-O-octanoyl-α-d-glucopyranoside; 12. β-(1->6)-galactotriitol; 13. (2S,3R)-2-Azaniumyl-3-hydroxyoctadecyl phosphate.

Distribution of the features between parts was compared using principal component analysis (PCA) of loading and score plots describing a significant grouping by part for both extraction phases (Figure 2). In addition, quality controls (QC samples) were also considered, as shown in Figure 2. The QC samples cluster tightly in comparison to the total variance in the projection, suggesting a dataset deemed to be of high quality. Tables 1 and 2 describe loadings that mostly contribute to principal components for each extraction. Putative organic compounds catalogued as saponins (SPNS), such as Tuberoside J, Matesaponin 5, and Asparagoside B significantly contributed to component 1 in the aqueous phase, while other important features for component 2 belong to flavonoid class (FLV) and SPNS compounds. Furthermore, both components in the organic phase were mainly composed of glycerolipids (GL) and terpenoids (TER), as well as a putative carotenoid, (5cis,5 cis,9cis,11 cis)-1,2,7,7 ,8,8 -Hexahydro-1,2-epoxy-ψ, ψ-carotene were important for contribution of component 1 in organic phase.

**Figure 2.** (**a**) Principal Component Analysis (PCA) Bi-plot of loadings (features: crosses) obtained in ESI positive mode and scores (samples: colored circles) extracted with methanol:water phase (component 1:33.10%; component 2: 15.72%; loadings = 1294 features; n = 26); (**b**) PCA Bi-plot of loadings (features: crosses) and scores (samples: colored circles) extracted with diethyl ether phase (component 1:37.15%; component 2: 15.14%; loadings = 1391 features; n = 24).


**Table 1.** Loadings most contributing to principal components for the aqueous phase.

PC1: Principal Component 1; PC2: Principal Component 2; BZD: benzoyl derivate; FLV: flavonoid; SPNS: saponin.



PC1: Principal Component 1; PC2: principal Component 2; BZD: benzoyl derivate; Caro: carotenoid; GL: Glycerolipids; SPNS: saponin; TER: terpenoid.

#### *2.2. Level 1 and 2 Metabolomic Identification Analysis*

We putatively identified approximately 270 compounds and classified them in 52 compound classes (Table S1). Putative identifications were taken into consideration with a high match score (>90%). Terpenoids, fatty acids, and glycosylated compounds were the most abundant groups. As was predicted, different capsaicinoid compounds were also detected with a high match score. Alkaloids, carotenoids, saponins, and phenolic compounds were also detected in our study.

Capsaicin and dihydrocapsaicin were validated by matching their retention times and MS/MS spectra with those of the analytical standard (Level 1 identification). Furthermore, commonly reported compounds, such as carotenoids and capsaicin related compounds, were detected and putatively identified in the same manner in our samples (Figure 3).

**Figure 3.** Mass spectrum of most common compounds in Tabasco chili pepper using Ultra High Pressure Liquid Chromatography MSMS Quadrupole Time of Flight (UHPLC-MS2 Q-TOF; collision energy ramp: 20–40 eV) ESI positive ionization mode of placenta tissue, as putatively identified by Progenesis QI for small molecules. (**a**) Capsaicin; (**b**) β-carotinal; and (**c**) dihydrocapsaicin.

#### *2.3. Di*ff*erent Metabolomic Profiles in Capsicum Sections*

Based on results from volcano plot comparisons, we found a total of 60 putative compounds with significant differences between the parts of the Tabasco pepper fruit. Figure 4 shows the distribution of features between placenta and pericarp tissues. As was expected [28], capsaicinoids were more abundant in placenta than pericarp. In contrast, pericarp was richer in glycosylated compounds and terpenoids such as acalyphin, capsiate, and capsidiol. Some significant ions remain unknown that still need to be identified.

**Figure 4.** Volcano plot comparison of relative abundance between tissues of 1394 features in ESI positive ionization mode: Placenta (left), and pericarp (right), unchanged (green); one-way ANOVA *p* = 0.05 (dotted line) Y axis: p value, X axis: fold change.

After feature screening and putative identification, Venn diagrams were generated (Figure 5) to show similarities and differences between fruit parts, according to the fold change values obtained in Table S1. Noticeably, around 30% of the complete dataset of identification was shared by all three fruit parts in almost all the extraction solvents, the exception being for the organic phase in the negative ionization mode, which only share 9.32% similarities. As can be seen in the Venn diagrams, a greater number of putative metabolites were identified in the positive ionization mode. Those easily detected in the positive ionization mode were the compound classes alkaloids, carotenoids, fatty acids, glycosylated compounds, terpenoids, and saponins; while in the negative ionization mode amino acid-derivate compounds, sphingolipids, and phospholipids were detected.

**Figure 5.** Venn diagrams of the complete dataset of putative metabolites in different fruit parts at the red mature stage of Tabasco chili pepper (*C. frutescens*); labels are in percent and number of metabolites.

Listed in Table 3 is the putative identification that presents a differential abundance between fruit parts, including the fragmentation pattern (MS2) and adducts for putative annotation. Fold change values are shown in Table S1.





AK: Alkaloids; BZD: benzoyl derivate; CAPS: capsaicinoids; CBN: carbonyl derivate; FAT: fatty acids; FLV: flavonoids; GC: glycoside compounds; Jasm: jasmones; PPN: phenylpropanoids; PYR: pyrazines; SPNS: saponins; TER: terpenoids; TPHE: tocopherols. Relative abundance values of fragments ions are in brackets.

#### **3. Discussion**

The global metabolic comparison between the tissues and seeds of *C. frutescens* showed several feature differences between the pericarp, placenta, and seed. The Level 1 and 2 confidence metabolite annotations allowed us to assign a putative identification to these ions. Around 30% of metabolites were shared between all three parts. Compounds related to the primary metabolism showed few significant differences, they included amino acid related compounds, fatty acids, and phospholipids. As shown in the Venn diagram (Figure 5) and Table S1, placenta and pericarp have the biggest compound class diversity. Significantly, the seeds presented a higher number of putative identifications, and these were primarily saponins, terpenes, and fatty acids.

Pericarp compound classes were mainly composed of glycosylated compounds and terpenoids. Complementary to these findings, Materska demonstrated that chili pepper pericarp is abundant in glycosylated compounds [30]. Likewise, terpenoids were distributed in the whole fruit but pericarp showed a slightly higher proportion of them. These compounds are highly abundant in spices and herbs and give a wide range on the aroma and flavor spectrum [1]. Similarly, terpenoids have been described as showing antibiotic properties [41] and have been used in fragrances [42].

Placental tissue showed a large number of previously reported compounds with bioactivity, mainly capsaicin- and capsinoids-related compounds. In addition, alkaloids and tocopherols were present, a fact that is in agreement with current literature [28,30]. Found to be abundant in this fruit compartment were 6-O-acetylaustroinulin (terpenoid) and Myrciacitrin V (flavonoid) and they have not been reported in the *Capsicum* fruit.

The compounds found in chili seeds were predominantly fatty acids (3,16-Dihydroxypalmitate, Sterculynic acid) and saponins (Capsicoside A, Eleutheroside L and Tragopogonsaponin F) where they function as reserve nutrients for embryo development and propagation [43]. Moreover, seeds showed the presence of terpenes (Ursolic acid 3-[glucosyl-(1->4)-xyloside]) which are known to function as a natural promoter of predation and, as a consequence, a seed disperser [44]. Ritota et al. (2010) reported an abundance of fatty acids in sweet pepper species by nuclear magnetic resonance spectroscopy, in which polyunsaturated fatty acids were easily detected and pre-identified [6].

Our results were consistent with previously reported findings regarding the large diversity of secondary metabolites in fruits of *Capsicum* species and the non-targeted metabolomics profiling of Solanaceae [10,26,39,45–48]. Furthermore, new compounds, such as Myrciacitrin V, Feruloyl-β-sitosterol, 6-O-acetylaustroinulin and others, were putatively annotated as statistically significant in specific fruit parts.

Capsaicin, Dihydrocapsaicin, and capsaicinoids derivatives mainly accumulate in the placenta, as previously reported [27,49]. This class of compounds represents the most described and abundant metabolite in this genus and are predominantly known as being responsible for the pungency. Different bioactivity assays have been developed, demonstrating properties of capsaicinoids over different cell lines and metabolism, including as an analgesic and for weight-loss [22,50]. Large abundance of capsaicinoids in the chili fruit placenta was proposed by Tewksbury and Nabhan (2001), who suggest that capsaicin selectively discourages vertebrate predators (capsaicin has been found to repel or poison mammals) without deterring more effective and important seed dispersers, such as birds. [51].

A variety of new compounds in *Capsicum* genus, also reported in different species, were detected in pericarp, including isopimaric acid [41], which is a terpenoid with bioactive properties. Additionally, other compounds such as Pedalitin [52], Xi-8-acetonyldihydrosanguinarine [53], Pratenol B [54], Uralenneoside [55] were detected in pericarp and have been previously reported as bioactive compounds. Quercetin 3-(6"-malonyl-glucoside) is an anthocyanin-related compound that has not been reported in pepper fruit, but this compound class is well known to be localized mainly in pericarp [56,57], due to its involvement as a protection system against solar damage in plants and to attract potential pollinators [29].

New putative compounds in placental tissue, such as Lycopodane [58], 2,4- Pentadiynylbenzene [59], Myrciacitrin V [60], Cinncassiol C [61], and 6-O-acetylaustroinulin [62]

have been reported as bioactive compounds, supporting the nutraceutical properties of chili pepper against metabolic disorders.

The existence of terpenes in seeds may result in different aromas that have been shown to firstly attract birds to mature fruits during the day [63] and secondly, to promote the dispersal of seeds. This function supports the ecological relationship between birds and chili pepper fruits, attracting the most beneficial vertebrate predators [51].

In summary, the non-targeted LC-MS metabolomics method that was developed in this study is shown to be a powerful tool for the putative identification of tissue-specific secondary metabolites at the red mature stage of chili pepper fruit. The use of databases available online gave rise to a faster comprehensive elucidation of global characteristics of a complex matrix than more traditional phytochemical studies. Nutraceutical, aroma, flavor, and new compounds that have not been reported before were putatively identified and related to pericarp, placenta or seeds of *C. frutescens*. As presented here, some of these compounds have been reported with bioactivity properties, supporting empirical properties of pepper fruit that have been known for centuries. The procedure developed here will be utilized for further studies in our laboratory, including to enable the exploration of comparisons between wild cultivars of chili pepper fruit with their cultivated counterparts and for the further understanding of secondary metabolism in this crop. We recommend that complementary analysis should be carried out to confirm structural elucidation. In addition, compound isolation and bioactivity properties should be considered in future studies.

#### **4. Materials and Methods**

#### *4.1. Plant Material and Dissection of Tissues and Seed*

Seeds of Tabasco pepper (*C. frutescens* L.) were treated with 3% hypochlorite solution. Plants were grown in optimum conditions (30–32 ◦C), at greenhouse facilities between June and September of 2016. Fruits from different plants were collected at 60 DPA (red ripe stage), washed with deionized water and immediately frozen with liquid nitrogen and stored at −80 ◦C until dissection and analysis.

Five biological replicates (plants) were considered for the experiment and three fruits per plant were collected. Each fruit was first placed into dry ice to facilitate hand dissection into pericarp, placenta, and seed using a sterile scalpel. All fruit parts were ground using a ball mill (Retsch MM301) under cold conditions and applying liquid nitrogen.

#### *4.2. Chemicals, Reagents, and Standards*

All chemicals and reagents were purchased from AccesoLab S.A. de C.V. (Mexico, Mexico). Capsaicin and dihydrocapsaicin analytical standards, formic acid, methanol, acetonitrile were HPLC grade and purchased from Sigma–Aldrich (Mexico, Mexico).

#### *4.3. Sample Extraction and UHPLC-MS Analysis*

For metabolite extraction, the method employed was adapted from Matyash [64] as follows: Methanol, 1.5 mL, was added to 100 mg of sample in a test tube and vortexed for 1 min, then, 5 mL of diethyl ether was added. The mixture was incubated with gentle stirring for one hour at room temperature. Next, 1.5 mL of ultra-pure water (18 Ω, milli-Q system) was added and mixed vigorously for a further minute then kept at room temperature for 10 min to allow phase separation. After that, the sample was centrifuged at 1000 × g for 10 min. Aqueous and organic layers were recovered separately and vacuum dried (miVac®, Genevac) at 30 ◦C for 30 min and finally kept at −80 ◦C until further analysis.

Three quality control (QC) samples were prepared to account for instrument drift and system calibration during analysis in UHPLC-QTOF-HRMS; each QC sample was prepared by mixing homogenously all sample extracts into a new single vial, in both separated phases containing polar and non-polar compounds. QC samples were distributed at the beginning, middle, and end of the injection run list. Analytical standards of capsaicin and dihydrocapsaicin were injected under the same conditions as samples. Extraction blanks were also considered during the experiment.

For LC-MS analysis, all samples (including QC, analytical standards and blank extraction) were resuspended in 1 mL of acetonitrile/ultra-pure water 50:50 (*v*/*v*) and filtered through a membrane of 0.2 μm (PTFE, Agilent Technologies, Santa Clara, USA). Samples were injected according to a randomized list order on an UPLC®(Acquity class I, Waters, Milford, CA, USA) coupled with an orthogonal QTOF (SYNAPT G1 HDMS, Waters, Milford, CA, USA) mass spectrometer. Chromatographic separation was achieved on a reversed phase CSH C18 column (2.1 mm × 150 mm, 1.7 μm, Waters, Milford, USA) maintained at 30 ◦C during chromatographic separation. Auto-sampling of 10 μL per sample was injected. Compounds were eluted using ultra-pure water with 0.1% (*v*/*v*) formic acid (solvent A) and acetonitrile with 0.1% (*v*/*v*) formic acid (solvent B) with a flow rate of 0.3 mL/min with the following gradient program: From 0.5 to 30 min, 1–75% B; 30 to 31 min, 75% B; 31 to 31.5 min, 75–100% B; 31.5 to 34.5, 100% B; 34.5 to 34.6, 100-1% B; 34.6 to 36 min, 1% B. The mass spectrometer mass range was set from 50 to 1500 Da. Both ionization modes were injected separated. For negative electrospray ionization (ESI) mode, the conditions were set as follows: Capillary voltage 2 kV; cone voltage 40 V; source temperature 150 ◦C; cone gas flow 20 L/h; desolvation temperature 350 ◦C; desolvation gas flow 600 L/h. For the positive ESI mode: Capillary voltage 3 kV; cone voltage 40 V; source temperature 130 ◦C; desolvation temperature 350 ◦C; desolvation gas flow 700 L/h. Leucine-Enkephalin (2 ng/mL) was infused as LockSpray reference internal mass calibrant at a flow rate of 5 μL/min and its signal was monitored every 10 s. The data format was collected in a continuum mode with a MS scan time of 1.5 s. In both the positive and negative ionization mode, data were acquired in MS<sup>E</sup> experiments; using Argon as the collision gas with a collision energy in the trap region of 6 eV (Function 1, low energy) and ranged from 20–40 eV (Function 2, high voltage).

#### *4.4. Data Analysis*

Raw data was imported to Progenesis QI for small molecules software (Non-Linear Dynamics, Waters, Milford, MA, USA) for automatic alignment, normalization, deconvolution, and compound pre-identification over all samples separating the aqueous and organic phases. The RT range was limited from 0.5 to 35 min for pre-identification method. Pre-identification was performed using Chemspider Databases (PlantCyc, Plant Metabolic Network, KEGG, HMDB and ChEBI) and with an in-house database with a minimum match of 90% for precursor ions, MS/MS data and isotope distribution was included for increasing match score values. Statistics and graphics were performed using EZinfo 3.0 (Waters, Milford, MA, USA) and R (3.3.3v, Vienna, Austria) [65] software. Compounds were grouped according to their compound classes. The resulting data was first mean centered and scaled to Pareto and then submitted to a principal component analysis (PCA) using the first three components. Results were analyzed using one-way ANOVA and q-values were established using the false discovery rate (FDR < 0.01) to correct multiple comparisons by the Benjamini–Hochberg procedure [66].

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/10/206/s1, Table S1: Putative identification.xlsx.

**Author Contributions:** Conceptualization, J.J.O.-O.; data curation, F.C.-H. and P.A-G.; formal analysis, F.C.-H. and P.A.-G.; funding acquisition, J.J.O.-O.; methodology, F.C.-H. and P.A.-G.; supervision, O.M. and J.J.O.-O.; writing—original draft, F.C.-H.; writing—review and editing, O.M. and J.J.O.-O.

**Funding:** This work was supported by Consejo Nacional de Ciencia y Tecnología (CONACYT Mexico) for grant number 264354 and scholarship: 587878.

**Acknowledgments:** We thank Fernando Hernández-Godinez (Computational Biology Lab), María Esperanza Anaya-Gil (Metabolomics Lab), and Neftalí Ochoa-Alejo for their assistance, support, and suggestions during the experiment.

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the result.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
