*Article* **Comparison of Bi- and Tri-Linear PLS Models for Variable Selection in Metabolomic Time-Series Experiments**

#### **Qian Gao 1, Lars O. Dragsted <sup>1</sup> and Timothy Ebbels 2,\***


Received: 3 March 2019; Accepted: 8 May 2019; Published: 9 May 2019

**Abstract:** Metabolomic studies with a time-series design are widely used for discovery and validation of biomarkers. In such studies, changes of metabolic profiles over time under different conditions (e.g., control and intervention) are compared, and metabolites responding differently between the conditions are identified as putative biomarkers. To incorporate time-series information into the variable (biomarker) selection in partial least squares regression (PLS) models, we created PLS models with different combinations of bilinear/trilinear **X** and group/time response dummy **Y**. In total, five PLS models were evaluated on two real datasets, and also on simulated datasets with varying characteristics (number of subjects, number of variables, inter-individual variability, intra-individual variability and number of time points). Variables showing specific temporal patterns observed visually and determined statistically were labelled as discriminating variables. Bootstrapped-VIP scores were calculated for variable selection and the variable selection performance of five PLS models were assessed based on their capacity to correctly select the discriminating variables. The results showed that the bilinear PLS model with group × time response as dummy **Y** provided the highest recall (true positive rate) of 83–95% with high precision, independent of most characteristics of the datasets. Trilinear PLS models tend to select a small number of variables with high precision but relatively high false negative rate (lower power). They are also less affected by the noise compared to bilinear PLS models. In datasets with high inter-individual variability, bilinear PLS models tend to provide higher recall while trilinear models tend to provide higher precision. Overall, we recommend bilinear PLS with group x time response **Y** for variable selection applications in metabolomics intervention time series studies.

**Keywords:** time series; PLS; NPLS; variable selection; bootstrapped-VIP

#### **1. Introduction**

Metabolomics is a widely applied technology for capturing the perturbations of metabolites in biological systems and for discovery of dietary and health biomarkers. Liquid chromatography– mass spectrometry (LC-MS), nuclear magnetic resonance spectroscopy (NMR), and gas chromatography–mass spectrometry (GC-MS) are most commonly employed in metabolomics studies providing information-rich, high throughput data [1]. Such data contains information on hundreds or even thousands of metabolites, resulting in challenges for both data pre-processing and statistical analysis [2].

Biomarker discovery in metabolomic studies consists of several stages: collection of biological samples under different conditions; application of analytical techniques for characterising the "unknown" metabolome; extraction of information from raw analytical data; statistical analysis to select putative biomarkers with the capacity to discriminate the samples from different conditions; and further studies to validate the performance of selected biomarkers [3]. Selection of variables (putative biomarkers) plays an important role in the process as it determines the scale and outcome of later validation studies [4]. It is crucial to keep the number of selected variables at a reasonable level without compromising the number of true positives.

Time-series design has been adopted in many metabolomic studies for both biomarker discovery and validation stages. It is advantageous because it allows discovery of biomarkers responding to an intervention and provides time response information of biomarkers, which is of importance to select the best time window for sampling [5]. Figure 1 shows eight different types of temporal profiles typically seen in response to intervention in acute metabolomic studies (<24 h). Metabolites responding differently between the groups in such studies may vary in their temporal response profiles as seen in (**a**)–(**f**). Other metabolites (**g**)–(**h**) show no difference in response between control and intervention, or vary randomly over time, which is often the case for the majority of metabolites.

**Figure 1.** Typical temporal profiles of metabolites observed in metabolomics data from our onion study with a time-series design. (**a**)–(**h**) are temporal profiles of eight metabolites in control (grey) and intervention (purple) group. More details are explained in Text S1.

A time-series design yields more information but also leads to more complex data. Not only are the variables correlated but also temporal autocorrelation exists between time points. The classical supervised multivariate approach adopted in many metabolomic studies is PLSDA followed by variable (biomarker) selection [6]. PLSDA is a classification method based on classical PLS regression where the response variable, **y**, is categorical and represents which treatment group each sample belongs to [7]. Normally the model is built on the data acquired from a single time point, from combined time points or on pooled samples. However, in this case, only treatment group information is used while all the time response information is ignored.

Some attempts have been made to take time-series information into account during PLS modelling. One approach is to use time of sampling or maturity of the process as the response variable, **y**, [8] which has been applied in a small number of metabolomic studies [9,10]. The problem with this method is that it works well only when there is a linear relation between variables and time, which is often not the case (see Figure 1). Another approach is piecewise Orthogonal Projections to Latent Structures (OPLS), which uses a set of sub-models to describe the changes between successive time points [11]. This does not assume any linear trend between data and time, which makes it suitable for the analysis of non-linear response over time. However, the time-series information is distributed in a range of sub-models which hinders interpretation. A variety of non-PLS methods could also be adopted for modelling metabolomics time-series data but there are some limitations. Autoregressive

moving average with exogenous inputs models (ARMAX) or space-state models can be used to describe the temporal profiles, but typically requires more time points (>10) [12]. Smoothness or its combination with dimension reduction method have also been developed and applied for time-series data [13]. However, all the methods above mainly focus on predicting response to a treatment over time instead of selecting important variables which discriminate between different treatments. More investigation of time-series models using PLS for metabolomics analyses is therefore needed, especially with respect to variable selection, in order to provide better guidance on optimal data analysis of such datasets. Conventionally, metabolomic time series data are constructed into a two-way structure (Sample × Metabolite) in PLS modelling where the time response information is overlooked. To incorporate such information into the data structure, time can be considered as the third mode. In this study, five different bi- and tri-linear PLS models were used to identify important variables contributing to the difference between groups in intervention response metabolomics studies with a time-series design. The variable selection performance of the five models were evaluated on both simulated and real datasets to provide insight into the most appropriate modelling approach for intervention response time series experiments.

#### **2. Materials and Methods**

#### *2.1. PLS and NPLS*

PLS is a latent variable based multivariate linear regression between predictor variables (**X**) and dependent variables (**Y**), which aims at maximizing the covariance between the **X** and response **Y** [14]. N-PLS is an extension of PLS to multiway data [15] where **X** is an array with more than two dimensions (also referred to as ways or modes). Compared to PLS, N-PLS provides simpler models with relatively few parameters and avoids the interference between information from different modes [16]. Metabolomic data with a time-series design can be structured as a two-way table of dimensions *I* × *J*, where *I* = *S* × *T* (*S* = number of subjects; *T* = number of time points), *J* = number of metabolites. During the analysis, such data is divided into subsets according to time points for further analysis separately or analysed as a whole. In either case, the time-series information is not used by the model and correlation between samples collected at different time points from the same subject is lost. To make use of this autocorrelation between samples, we transformed the metabolomics time-series data into a three-way array with a size of *S* × *J* × *T* and analysed it using N-PLS. In the current paper, both two-way PLS and N-PLS models are used to analyse metabolomic time-series data and they are referred to as bi-PLS (bilinear-PLS) and tri-PLS (trilinear-PLS) models, respectively.

#### *2.2. PLS-DA and Dummy Y*

In standard PLSDA, class labels indicating the group membership of each sample are used as dependant **Y** (dummy **Y**), e.g., y for the intervention group sample is 1 and for the control group sample is 0 or −1. However, samples obtained from different time points can be very different within the same class causing large variation within classes and consequently lead to poor predictions. Dividing the data and building separate models for each time point can reduce this problem but in this case, there are fewer observations for each comparison, and most importantly, the time response information is not modelled. To include such variation into the dependent **Y** and provide more guidance on the separation of samples, we created a new 'time response' dummy **Y** to reflect how the metabolites respond to the intervention with time. Specifically, for the target metabolites (Figure 1a–f), their excretion experiences an increase and a decrease within a certain time frame i.e., their intensities are higher or lower in the middle of the time-series than that at the beginning or at the end of the time-series. Therefore, we assign the samples acquired from the middle of the time-series capturing the high intensities of the target metabolites to a 'response class' and samples acquired from the first and last time point to a 'no-response class'. Samples from these two classes are labelled with 10 and 1 respectively, which would be subsequently used as dummy **Y** in further PLSDA modelling. In our experiments, the model

performance improved with increasing magnitude of the 'time response class' until it reached around 10. Therefore, 10 was used as the label for 'response class' in this paper. The value of the time response should be a user-defined value and it can be adjusted by testing with a range of values to find out the optimal one achieving the best predictive performance (Q2 or area under the ROC curve) of the model.

#### *2.3. Comparison of Variable Selection by Five PLS Models*

In order to take advantage of the time-series data structure and to make use of both group and time response information, we combined different PLS models (bi-PLS or tri-PLS) with different dummy **Y**s (group or time response labels) as shown in Figure 2. Models 1–3 are bi-PLS models built on a two-way matrix **X** of size *ST* × *J*. Model 4–5 are tri-PLS models built on a three-way array **X** of size *S* × *J* × *T*. For models 1–4, group labels, time response labels or their products are used as a one-way dummy **Y**. Model 5 uses a two-way dummy **Y** with group label as the first mode and time response label as the second mode. We note that model 1 only addresses group differences, while model 2 only addresses time response changes. Since we are interested in both group and time responses, we included these two basic models against which to compare the more complex models 3–5. The five PLS models were applied on the same datasets and their performances were compared.

**Figure 2.** Structure of five PLS models for comparison. <sup>a</sup> The dummy **Y** in this figure is an example for data obtained from eight samples collected from two subjects at four time points (0, 2, 4, 24 h after intervention). Dummy **Y** in purple and grey colour corresponds to samples collected from Subject 1 (from intervention group) and Subject 2 (from the control group), respectively.

The focus of this paper is on the ability of the models to highlight variables important to the time-treatment response. In PLS regression, variable selection is used to improve model performance to provide better predictions [17]. It identifies variables with large influence on the model, which could be used to interpret the model and to be investigated as potential biomarkers in further studies. In the current paper, VIP scores were calculated to identify the relevant variables and a bootstrap procedure was adopted to estimate VIP uncertainty.

#### *2.4. Datasets*

#### 2.4.1. Simulated Datasets

In order to assess the variable selection performance of the five PLS models, a data simulation procedure is proposed to simulate the time-series metabolomic dataset. For a simulated dataset, we generated J variables and for each variable j, the observations for a subject *s* in group *g* are generated according to the following equation:

$$
\mathfrak{x}\_{\mathfrak{F}} = \mu\_{\mathfrak{F}} \circ (b\_{\mathfrak{s}} + w\_{\mathfrak{s}} + \mathfrak{e}\_{\mathfrak{s}})
$$

where ◦ denotes the entry-wise product and <sup>μ</sup>*<sup>g</sup>* = *<sup>c</sup>* + *at*α*e*−β*<sup>t</sup>* . μ*<sup>g</sup>* is the vector containing the values of the mean curve for the group g, of dimension 1 × *T* and *t* is the time. *c*, *a*, α, β are generated from uniform distributions, the intervals of which are adjusted to create different temporal profiles, as shown in Figure 1a–h for both intervention and control groups (see Figure S1). The 1 × T vector *bs* controls inter-individual variability, which follows a normal distribution with zero mean and covariance matrix σ2 *b* **1***T*, where **1** denotes a matrix with all entries equal to 1, with σ<sup>2</sup> *<sup>b</sup>* being the inter-individual variance. The intra-individual variability denoted by the 1xT vector *ws* is taken to be multivariate normally distributed with zero mean and covariance *Dw*. *Dw* is a first-order autoregression covariance matrix of dimension *<sup>T</sup>* <sup>×</sup> *<sup>T</sup>* with entries being *Dw*(*i*, *<sup>j</sup>*) = <sup>σ</sup><sup>2</sup> *w*ρ|*i*−*j*<sup>|</sup> , where σ<sup>2</sup> *<sup>w</sup>* is the intra-individual variance and ρ is the autocorrelation coefficient between two consecutive time points. The noise ε*<sup>s</sup>* is normally distributed with zero mean and covariance matrix σ<sup>2</sup> <sup>ε</sup>*1T*, of dimension *T* × *T*.

Sixteen datasets with different numbers of subjects, numbers of variables, inter-individual variability, intra-individual variability and number of time points were generated with the above simulation method. In each of the sixteen datasets, eighty discriminating variables were simulated. Table S1 provides an overview of the characteristics of all the datasets.

In the simulated dataset, the variables with the temporal profiles, (**a**)–(**f**) in Figure 1, were considered as discriminating variables, which are the target of variable selection while avoiding selection of variables with profiles (**g**)–(**h**) in Figure 1.

#### 2.4.2. Onion Intervention Data

This data is taken from a randomized controlled trial with a crossover design, where participants were assigned to either an onion consuming group or a control group. Untargeted UPLC-qTOF-MS was applied to measure the metabolites in urine samples at four time points (0, 2, 4, 24 h after intervention) for six subjects per group [18]. The resulting raw data consists of 48 samples.

#### 2.4.3. Coffee Intervention Data

This data is generated from a randomized controlled trial with a crossover design, where urine samples were collected at 0, 0.5, 1, and 2 h after intervention with coffee or control drink (water) from 11 subjects per group. A total of 88 samples were analysed with untargeted UPLC-qTOF-MS [18].

Both onion and coffee raw data were converted to NetCDF files using DataBridge (Waters, Manchester, UK) and analysed with MZmine 2.19 for data peak detection, alignment and quantification [19]. The preprocessed data were imported into MATLAB and feature reduction was applied to remove unreliable variables due to compounds with extreme retention times, variables not detected in more than 70% of the samples in each subgroup and variables with a coefficient of variation (CV) in pooled quality control samples higher than 0.7 [20]. The resulting onion and coffee data sets had dimensions (samples x variables) of 48 × 3209 and 88 × 2321, respectively.

For onion and coffee intervention data, true discriminating variables are not known *a priori*. However, to enable evaluation of the variable selection performance on this real data, 'truly' discriminating variables were determined by two methods. First, visual inspection was applied to identify variables exhibiting profiles similar to (**a**)–(**f**) in Figure 1. Second, a t-test was applied at each

timepoint and the variable flagged if at least one time point was significant with a nominal *p* < 0.05. Variables were considered discriminating if both methods indicated a difference, and were the object of variable selection procedures.

#### *2.5. Workflow*

The assessment of the variable selection of the five PLS models was performed on the simulated datasets as well as real datasets. The workflow is outlined in Figure 3 and explained in the following sections.

**Figure 3.** Workflow for the evaluation of variable selection performance of five PLS models on simulated datasets (top) and real datasets (bottom). For the simulated datasets, a single cross validation was applied on one of them to determine the optimal number of latent variables, which was then applied to all the similar simulated datasets for building the PLS models and variable selection. For real datasets, the optimal number of latent variables was obtained based on a single cross validation on the whole dataset and the PLS models were built on the whole dataset for variable selection.

#### 2.5.1. Pre-Processing of Data

Centring and scaling are commonly applied prior to the regression modelling and have a critical influence on the performance of the model. Centring is performed to shift the mean of the data to zero and scaling is used to adjust the relative influence of variables with different variability. Centring across the first mode (samples or subjects) is a widely accepted step for both two-way and three-way data while scaling is more complicated. Scaling within one mode may disturb other modes [21,22]. In the current study, centring across the first mode was applied for both two-way and three-way data. For the two-way data, the values for each variable (column) were scaled to unit variance. On the three-way data, single-slab scaling within the metabolite mode was applied, as recommended by Gurden et al. [23]. (A slab is a single layer of the three-way array, here corresponding to a single variable). In single-slab scaling, each variable in the jth slab is scaled to unit root-mean-square of the slab (RMSj):

$$\text{RMS}\_{\text{\%}} = \sqrt{\frac{\sum\_{\text{s=1}}^{\text{S}} \sum\_{\text{t=1}}^{\text{T}} \mathbf{x}\_{\text{\%}}^{2}}{\text{ST}}}$$

$$\mathbf{x}\_{\text{\%}}^{\*} = \frac{\mathbf{x}\_{\text{\%}}}{\text{RMS}\_{\text{\%}}}$$

where xsjt is the intensity of metabolite j in the sample acquired from subject s at time point t, x<sup>∗</sup> sjt is the single-slab scaled data.

#### 2.5.2. Model Optimization and Evaluation

For both simulated and real data, a single cross validation scheme was implemented, and the optimal number of latent variables was decided as the smallest number at which the decrease in root mean squared error in cross validation (RMSECV) between consecutive models was less than 2%. Due to the similarity of the repeated simulations using the same parameters, for the same type of PLS model, the number of latent variables was determined on one dataset and adopted for all the other repeats.

A two-stage procedure was used to evaluate the performance of different models on simulated datasets. Each simulated dataset was divided into training and test sets. First, variable selection performance was evaluated on the training set. Next, the model's predictive ability was evaluated on the test set.

#### (1). Evaluation of Variable Selection Performance with Training Sets

Balanced bootstrapping was performed to resample B bootstrap datasets [24,25]. Various values of B were tested and B = 200 was chosen as the smallest value providing consistent results (data not shown). PLS models with an optimal number of latent variables were built on each bootstrap subset and the Variable Importance in Projection (VIP) was calculated for each variable [26,27]. For each variable, the mean (VIP\*) and standard deviation (σ*VIP*) of the B VIP values were obtained. The variable was selected if the lower-bound of the one standard deviation error bar was above 1 (i.e., *VIP*<sup>∗</sup> − σ*VIP* > 1).

To evaluate the variable selection performance of the five models, "Variable Selection ROC curves" were created. Since the discriminating variables are known, the model selecting the higher number of discriminating variables and lower number of non-discriminating variables is considered to have better variable selection performance. After the selected variables were obtained for each model, the number of variables considered as true positives, false positives, true negatives and false negatives were calculated according to Table 1. The comparison between convention ROC curve and Variable Selection ROC curve are shown in Figure S2.


**Table 1.** Variable selection confusion matrix.

The area under the variable selection ROC curve (AUVSC) was calculated to provide an evaluation of the overall variable selection performance of each model. The following scores were calculated:

Recall = TP/(TP + FN) Precision = TP/(TP + FP) F1-score <sup>=</sup> (β2+1)<sup>×</sup> Precision <sup>×</sup> Recall/β2<sup>×</sup> (Precision <sup>+</sup> Recall)

Recall reflects the model's capacity to select all the discriminating variables. Precision expresses the ability of the model to avoid the selection of non-discriminating variables. The F1-score is an overall assessment of the model's performance on recall and precision, assessing the effectiveness of the model to identify all the discriminating variables without selecting too many non-discriminating variables. β is set to 1 to emphasize the importance of both recall and precision for a reasonable selection of variables.

#### (2). Evaluation of predictive ability with test sets

The models with the optimal number of latent variables determined on the training sets were applied (using all variables) to the corresponding test sets. Predictive variance explained Q2, and area under the conventional ROC curve (AUC, using all variables) were calculated to evaluate the predictive ability of the model.

For real datasets, stage (1) evaluation of variable selection performance was performed on the whole dataset. Stage (2) evaluation was not applied because the numbers of subjects are too small in the real datasets to obtain an independent test set. Instead, a permutation test was performed to evaluate the validity of the model.

#### *2.6. Evaluation of the Influence of Characteristics of the Dataset on the Model Performance*

Characteristics of metabolome vary, e.g., between different human studies, from humans to animals, and from studies on diets or drugs thereby leading to different characteristics of the datasets potentially influencing the performance of the statistical methods. To evaluate such influences and to provide guidelines for the use of different models, all five PLS models were applied on all sixteen simulated datasets as shown in Table S1. The results from different datasets were compared to assess the influence of characteristics on the model performance. The datasets used to compare the evaluation of different characteristics are shown in Table S2.

All the calculations were performed in MATLAB Version R2015b (8.6.0.267246) (The Mathworks, Inc, Natick, MA, USA) using scripts modified from N-way toolbox [28] and multiway VIP package [27]. The code for building the five PLS models and performing variable selection is available at https://github.com/qian-gao/PLSvar\_sel. Simulated dataset 3 and anonymised onion intervention data are provided as examples for testing.

#### **3. Results**

#### *3.1. Assessment of Variable Selection Performance on Simulated Data*

#### 3.1.1. Overall Evaluation

The overall evaluation of the variable selection, prediction and classification performance of the five PLS models was performed on Dataset 3 (10 subjects, 3000 variables, 4 time points) and the results are shown in Table 2 and Figure 4. Dataset 3 was chosen for the overall evaluation because it has the characteristics that are most similar to those of the real dataset. As expected, bi-PLS models resulted in a higher number of latent variables than tri-PLS models indicating higher model complexity. Model 3 showed the best variable selection performance in that it provides the highest number of true positives with a relatively small number of selected variables, consequently leading to the highest precision. Model 1, 4 and 5 selected similar numbers of true positives while model 1 selected a higher number of false positives showing low precision. Model 2 showed the best prediction with highest Q2 but, as expected, provided a poor classification of the samples according to group. Unsurprisingly, only a few true positive variables were selected together with a large number of false positives resulting in the poorest precision. The number of latent variables did not have a strong influence on performance. Restricting all models to two latent variables (see Table S3), showed that the performance was not markedly different from that of those presented in Table 2.

**Table 2.** Performance of five PLS models evaluated on simulated datasets with the optimal number of latent variables.


<sup>a</sup> # LV, number of latent variables; <sup>b</sup> # Varsel, number of selected variables; <sup>c</sup> # TP, number of true positives. Values reported are mean and standard deviation across 100 repeats.

**Figure 4.** Evaluation of the variable selection performance of five PLS models on 100 simulated datasets. The variable selection performance consists of four criteria—area under the ROC curve (AUVSC), recall, precision, F1- score, which were calculated based on the variable selection confusion matrix.

Although the variable selection performances of the five models vary, the majority of discriminating variables were selected by at least two models, and variables selected by model 3 included approximate all the variables selected by other models (Figure 5). Beyond that, model 3 selected about eight unique true positives which were selected by none of the other models. The discriminating variables also had higher ranks in model 3 than the other models indicating its efficiency in variable selection (Figure 6). Model 4 and 5 resulted in low overall level of VIP scores and relatively larger variation, which caused a higher number of false negatives.

**Figure 5.** Comparison among discriminating variables selected by five PLS models in simulated Dataset 3. The coloured and white strips represent true positives (selected discriminating variables) and false negatives (unselected discriminating variables), respectively. The discriminating variables were arranged in order so that the variables selected by all five models were on the left side and the variables selected only by one model were on the right side.

**Figure 6.** Rank of VIP scores for the discriminating variables in five PLS models on simulated Dataset 3. Bootstrapped VIP scores for all the variables were ranked according to their mean VIP scores in descending order. Bars show the mean +/- one standard deviation. Red and black represent the variables which are discriminating or non-discriminating, respectively. The horizontal blue dash line corresponds to VIP = 1.

#### 3.1.2. Influence of Characteristics of the Dataset on the Performance of the Five PLS Models

The influence of number of subjects, number of variables, inter-individual variability, intra-individual variability and number of time points was assessed with the simulated datasets and the results are shown in Figure 7, Table 3, Table 4, Figure S3 and Figure S4.

As expected, variable selection performance (recall and precision) of all models was improved with increasing number of subjects (Figure 7). Notably, model 3 was only slightly affected by the number of subjects as it maintained its good recall and precision throughout all datasets, suggesting good robustness to this parameter.

**Figure 7.** Influence of number of subjects (6–12) on the variable selection performance of five PLS models on simulated datasets. Recall and precision were calculated based on the variable selection confusion matrix.

**Table 3.** Influence of the number of variables on variable selection performance of the five models on simulated data.


<sup>a</sup> # Varsel, number of selected variables; <sup>b</sup> # TP, number of true positives. Values reported are mean and standard deviation across 100 repeats.

**Table 4.** Influence of inter-individual variability on variable selection performance of five models on simulated data.


<sup>a</sup> # Varsel, number of selected variables; <sup>b</sup> # TP: number of true positives. Values reported are mean and standard deviation across 100 repeats.

Not surprisingly, the increased number of noisy variables in the data led to a higher number of selected variables and most of the extra selected variables are false positives (Table 3). The number of true positives in model 1–3 was not affected by the noisy variables while Model 4 and 5 selected fewer true positives but also fewer false positives under the influence of noise.

Variable selection performances of the five models were strongly affected by inter-individual variability in that all the models except model 2 selected fewer true positives with larger inter-individual variability (Table 4). When the inter-individual variability increased, bi-PLS models tended to maintain their recall by sacrificing the precision; tri-PLS models tended to maintain the precision by keeping a stable number of selected variables. Overall, model 3 was less affected by inter-individual variability than other models showing a good trade-off between recall and precision.

Intra-individual variability had little influence on the variable selection performance of five PLS models (Figure S3). As expected, a higher number of time points led to better recall of all the five models and model 1 benefited most from the extra temporal information (Figure S4).

#### *3.2. Assessment of Variable Selection Performance on Real Data*

The five PLS models were applied on the onion intervention data to discover variables discriminating the control and intervention groups and the results are shown in Figures 8 and 9. Similar to the simulated dataset, Model 3 provided the best recall and precision resulting in around 28 more true positives than the second best, model 1. Again, most of the discriminating variables had high ranks in model 3. Model 4 and model 5 were not capable of selecting many true positives in this more challenging real dataset, perhaps due to their tendency to maintain precision by keeping a low number of selected variables when dealing with data having large inter-individual variability. The low overall level of VIP scores and relatively large variation could also be the reason why so few variables were selected in Model 5. Interestingly, when variables were selected according to loading weights (instead of VIP), the performance of model 4 and 5 was improved, and was similar to the performance of model 1, but still not better than Model 3 (see Figure S6). A permutation test was performed which showed that Model 3 was significant at *p* < 0.001 (Figure S9).

**Figure 8.** Evaluation of the variable selection performance of five PLS models on onion study data. Recall, precision and F1-score were calculated based on the variable selection confusion matrix. # LV, number of latent variables; # Varsel, number of selected variables; # TP, number of true positives.

**Figure 9.** Rank of VIP scores for the discriminating variables in five PLS models on onion study dataset. Bootstrapped VIP scores for all the variables were ranked according to their mean VIP scores in descending order. Red and black represent the variables which are discriminating or non-discriminating, respectively. The horizontal blue dash line corresponds to VIP = 1.

#### 3.2.1. Coffee Intervention Study

In coffee intervention study, urine samples were collected at 0, 0.5, 1, and 2 h after intervention. Due to the short sample collection period, the temporal profiles of metabolites were incomplete as shown in Figure S7. In this case, the time response class was labelled as 1 for the samples collected at 0 h and 10 for the samples collected at 0.5, 1, and 2 h after intervention. The performances of the five PLS models on coffee intervention data were similar to that for simulated data and the results were shown in Figures 10 and 11. Model 3 gave the highest number of true positives with a reasonable number of selected variables. It also provided the most comprehensive list of selected variables; its selection of true positives included almost all the true positives found in all the other models (see Figure S8). The permutation test (Figure S9) indicated Model 3 was significant at *p* < 0.001 confirming that it was not overfitted and therefore its good variable selection performance was valid. Tri-PLS models were very conservative in that they selected fewer variables but gave very high precision. In fact, the discriminating variables had better ranks in tri-PLS models than in Model 1, so that if we lower the threshold for bootstrapped-VIP scores, model 4 and 5 would outperform model 1.

**Figure 10.** Evaluation of the variable selection performance of five PLS models on coffee study data. Recall, precision and F1-score were calculated based on the variable selection confusion matrix. # LV, number of latent variables; # Varsel, number of selected variables; # TP, number of true positives.

**Figure 11.** Rank of VIP scores for the discriminating variables in five PLS models on coffee study dataset. Bootstrapped VIP scores for all the variables were ranked according to their mean VIP scores in descending order. Red and black represent the variables which are discriminating or non-discriminating, respectively. The horizontal blue dash line corresponds to VIP = 1.

#### **4. Discussion**

In this paper, five PLS modelling approaches for metabolomic time series data were evaluated with simulated and real data with the objective of identifying variables showing discriminatory temporal patterns. The variable selection performance of the models was compared on simulated datasets based on their capacity to select discriminating variables while avoiding non-discriminating variables. The influence on model performance of five factors (number of subjects, number of variables, inter-individual variability, intra-individual variability and number of time points) was assessed to provide additional information on the application of suitable models for different scenarios of data.

Several issues have been considered regarding the development of these models. Bootstrapped-VIP scores were calculated to evaluate the importance of variables in the current paper. This approach was shown in previous studies [4,29,30] to be sensitive and precise in selecting relevant variables. However, we are aware that it might not always be the optimal approach for all models or datasets. For example, in the analysis of the onion intervention study, the loading weight for the first component was a more powerful selection tool for N-PLS models. This might be due to the fact that loading weight for the first component directly reflects the covariance between **X** and **Y**. Additional components that are found in the residuals after removal of previous components from **X** might be influenced by irrelevant information in **X** [31]. Therefore, the inclusion of the information from extra components does not necessarily result in better variable selection performance of the model. This may also explain why no strong difference in variable selection performance was observed between the models with different number of latent variables.

Another issue is the applicability of the proposed model to data with incomplete temporal profiles, where metabolite levels may not return to pre-intervention levels. For example, the coffee intervention study collected data at 0 h and 0.5, 1, and 2 h after intervention which means the whole excretion profile of the metabolites might not be recorded since the sampling period was too short (Figure S7). Our results from the coffee study indicate that incomplete temporal profiles can still provide information for the identification of discriminating variables as long as the 'response class' and 'non-response class' (i.e., responding and nonresponding time points) are accurately assigned.

The resulting models were assessed on both simulated and real data and our results were consistent in showing that 1) The bi-PLS model with combined time response and group information as **Y** (model 3) had the best variable selection performance and the most comprehensive list of true positives for all datasets tested. 2) The tri-PLS models tested both tend to maintain high precision by sacrificing recall, however they show robust performance on data with a high number of noise variables. 3) In datasets with high inter-individual variability, bi-PLS models tend to provide higher recall while tri-PLS models tend to provide higher precision. As expected, the bi-PLS model with time response as **Y** (model 2) performed most poorly under all conditions confirming that time response alone is not enough to discriminate samples from different classes.

Discovery, identification and validation of biomarkers in metabolomic studies is difficult and time-consuming. The goal is often to provide a list of discriminating variables with as many true positives and as few false positives as possible. Based on this goal and our comparison between the five models, Model 3 provided both good recall and precision and therefore represents a good choice for suitable datasets with time response profiles in two treatment groups. When the time dependant response is not recorded, model 1 and 4 may be adopted as the best general approach and they can be selected in different situation depending on the purpose of the study. For instance, Model 1 would be a good choice for exploring the data and collecting as many relevant variables as possible since it tends to keep high recall at any costs. For studies aiming at finding biomarkers with potential to classify new samples, model 4 has the potential to select the most suitable metabolites because of its good precision.

Time-series designs are widely used in life science research and the purpose is to observe the response of a biological system to a certain challenge over a defined time period. Although this work is demonstrated with LC-MS metabolomic data, it is applicable also to other types of multivariate time-series data, such as RNA-seq experiments aiming to detect the gene expression differences between experimental groups. Several methods have been proposed previously to deal with this type of time-series data. Bar-Joseph et al. [32] describes gene expression over time as a continuous curve and identifies genes showing significant temporal expression differences based on the difference between the curves. To accurately fit the curve representing the temporal profile, this method usually requires relatively long time series and homogeneous data which is not often available due to limitations of the study design or high inter-individual variability. Compared to this method, in our study Model 3 successfully dealt with short time series data and maintained high recall and precision even in the presence of high inter-individual variability. Regression-based methods have also been developed where gene expression is described as a function of time and regression coefficients of each gene from different experimental groups are compared using ANOVA [33]. Compared to this method, our Model 3 and 5 retain the multivariate structure and thus take correlations between variables into account. ANOVA-simultaneous component analysis (ASCA) is another popular method that can be applied to time-series data [34]. The data is separated into the variations that contributed by different experimental design factors such as time, dose of intervention, and their interactions using ANOVA equation. Simultaneous component analysis is then applied to different variations to approximate the scores and loadings in each sub-model. ASCA is efficient in separating design factors and exploring the data correspondingly. However, it is not able to select variables with specific response profiles (e.g., (a)–(f)) as our models do but only indicate if there is an overall difference. Moreover, these five models have low computational cost.

In summary, both simulated and real data demonstrate that bilinear PLS model with group × time response as dummy **Y** is a powerful method for variable selection in time-series experiments. It maintains good performance in the presence of noise and high inter-individual variability. In general, bi-PLS models tend to provide higher recall while tri-PLS models tend to provide higher precision.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/5/92/s1. Figure S1: Temporal profiles of metabolites in simulated data with a time-series design. Figure S2: Convention ROC curve and Variable Selection ROC curve (VSROC) for a. model 1 and b. model 3 in simulated Dataset 3. Figure S3: Influence of intra-individual variability (0.1–0.4) on the variable selection performance of five PLS models on simulated datasets. Figure S4: Influence of number of time points (3–6) on the variable selection performance of five PLS models on simulated datasets. Figure S5: Comparison among discriminating variables selected by five PLS models in onion study data. Figure S6: Comparison between the variable selection performances based on loading weight (green) and VIP (blue) of five PLS models on onion study data. Figure S7: Temporal profiles of metabolites observed in our coffee data with a time-series design. Figure S8: Comparison among discriminating variables selected by five PLS models in coffee study data. Figure S9: Area under the ROC curve (AUC) calculated on permuted **Y** data for model 3 (histogram, 1000 permutations) and original data (red dot) generated on samples obtained from onion (left) and coffee (right) studies. Table S1: Characteristics of sixteen simulated datasets. Table S2: Datasets used to compare for the evaluation of different factors. Table S3: Performance of five PLS models evaluated on simulated dataset with same number of latent variables. Table S4: Parameters used for pre-processing the onion study data in MZmine2. Table S5: Parameters used for pre-processing the coffee study data in MZmine2.

**Author Contributions:** Q.G., L.O.D. and T.E. designed the study. Q.G. and T.E. developed the strategy of data simulation and statistical analyses and Q.G. conducted the study. Q.G. and T.E. were responsible for the interpretation of the results. Q.G drafted the manuscript. All the authors reviewed the final manuscript.

**Funding:** The work was funded by a grant from the China Scholarship Council (201506350127) to Q.G. T.E. acknowledges support from National Institutes of Health (grants R01HL135486 and R01HL133932).

**Acknowledgments:** We express our gratitude to Camilla T. Damsgaard, Eduvigis Roldán-Marín, Concepción Sánchez-Moreno, M. Pilar Cano, Birgitte Borg and Lars O. Dragsted for performing the onion study and generously allowing us to use the data. We would like to thank Elin Rakvaag and Lars O. Dragsted for their work in the coffee study and kindly giving us the permission to use the data.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **rMSIKeyIon: An Ion Filtering R Package for Untargeted Analysis of Metabolomic LDI-MS Images**

**Esteban del Castillo 1,**†**, Lluc Sementé 1,**†**, Sònia Torres 1,2, Pere Ràfols 1,2,\*, Noelia Ramírez 1,2, Manuela Martins-Green 3, Manel Santafe <sup>4</sup> and Xavier Correig 1,2**


Received: 7 June 2019; Accepted: 30 July 2019; Published: 2 August 2019

**Abstract:** Many MALDI-MS imaging experiments make a case versus control studies of different tissue regions in order to highlight significant compounds affected by the variables of study. This is a challenge because the tissue samples to be compared come from different biological entities, and therefore they exhibit high variability. Moreover, the statistical tests available cannot properly compare ion concentrations in two regions of interest (ROIs) within or between images. The high correlation between the ion concentrations due to the existence of different morphological regions in the tissue means that the common statistical tests used in metabolomics experiments cannot be applied. Another difficulty with the reliability of statistical tests is the elevated number of undetected MS ions in a high percentage of pixels. In this study, we report a procedure for discovering the most important ions in the comparison of a pair of ROIs within or between tissue sections. These ROIs were identified by an unsupervised segmentation process, using the popular k-means algorithm. Our ion filtering algorithm aims to find the up or down-regulated ions between two ROIs by using a combination of three parameters: (a) the percentage of pixels in which a particular ion is not detected, (b) the Mann–Whitney U ion concentration test, and (c) the ion concentration fold-change. The undetected MS signals (null peaks) are discarded from the histogram before the calculation of (b) and (c) parameters. With this methodology, we found the important ions between the different segments of a mouse brain tissue sagittal section and determined some lipid compounds (mainly triacylglycerols and phosphatidylcholines) in the liver of mice exposed to thirdhand smoke.

**Keywords:** mass spectrometry imaging; metabolomics imaging; biostatistics; ion selection algorithms

#### **1. Introduction**

Mass Spectrometry Imaging (MSI) is a label-free analytical technique that can locate chemical compounds (metabolites, peptides, lipids, or proteins) directly in a biological sample and give their concentration for every pixel. The most common analytical strategy is MALDI due to its soft ionization, fast analysis, high throughput, versatility, and selectivity [1]. Other techniques, like desorption electrospray ionization (DESI), are becoming more popular because of the simplicity of their sample preparation [2]. MSI is currently used in the fields of drug discovery and toxicology [3,4]. In most experiments, researchers use a targeted strategy, which consists of visualizing and (sometimes) quantifying the concentration of a particular compound, or a reduced set of compounds throughout the tissue. Many MSI software packages have been released [5]. However, none of them provides an

automated workflow for untargeted MSI applications since the end-user has to approach each MSI experiment data analysis in its unique manner.

Besides annotating and identifying the MS ions, one of the main challenges in untargeted MSI analysis is to determine the statistically differentiating ions in different regions of interest (ROIs) of the same tissue section or in different tissues of case versus control experiments. These key ions could be associated with biomarker candidates of disease or treatment efficacy. Previous studies have successfully used segmentation processes to find these key ions between clusters [6,7]. Most of these studies identify the key ions associated with a certain region by analysing the ions that most influence the segmenting process. In [8], the authors applied a Non-negative Matrix Factorization multivariate analysis to select a reduced group of lipid MS signals associated with the metabolite profile of each component. The *t*-test associated with segmentation with Spatial Shrunken Centroids can find the enriched and absent MS peaks for a particular region in a segmented image [9,10]. A technique based on deep unsupervised neural networks and parametric *t*-SNE was used to detect metabolic hidden sub-regions [11]. The same algorithm, linked to a significance analysis of microarrays (SAM), detected the protein subpopulations that can differentiate between *t*-SNE segments in a dataset of breast cancer samples; interestingly, they used the selected ions for a kNN second segmentation step [12]. Gorzolka et al. [13] studied the space-time profiling of the barley germination process by carrying out an unsupervised joint segmentation of a high number of images and found the ion-associated profile for every segment. The Algorithm for MSI Analysis by Semi-supervised Segmentation (AMASS) was used to segment leech embryo samples [14] and there is a complete analysis of the ions associated to every region according to its weighting factors. In all these references, no statistical significance test was conducted on the key ions found.

Another common strategy in MSI data analysis is to manually define the ROIs to be compared, guided by an annotated histology image [15–18]. In general, the ions are selected by means of statistical hypothesis testing and the fold change (FC) calculation of the ion concentrations between ROIs. These parameters are usually represented as volcano plots. By way of example, Hong et al. [19] studied the global changes of phospholipids in brain samples from a mouse model of Alzheimer disease by performing ANOVA tests of ion concentrations in ROI. A common problem that MSI has in calculating statistical significance is that the *p*-values are generally extremely low [16]. This is because there are a large number of pixels within each ROI, which gives this parameter a low discrimination power.

Additionally, the statistical hypothesis testing (such as the *t*-test) fails when is applied to compare the concentration of an ion between ROIs. The existence of morphological areas in the images is the responsible of a high pixel autocorrelation. This violates the assumption of observation independence necessary for statistical hypothesis testing. In order to find statistically significant ions between ROIs, Conditional Autoregressive (CAR) models, which take into account the auto-correlated nature of ion distribution concentration in MS image ROIs, are calculated to correct the *p*-values [20]. Nevertheless, the difficulty of calculating the autocorrelation models and the complexity of the computational approach hampers the inclusion of this strategy in a MSI workflow.

Another common situation in MS imaging is the elevated intensity differences of the ions' concentration between pixels, due to the existence of several morphologic regions with different metabolic profiles [21] and the ion shielding phenomena which takes place in MSI. It is also common to find a high proportion of pixels where a certain ion is not detected, for a given signal to noise ratio. This influences to a large extend the calculation of the *p*-values and the FC.

In this study, we describe the development of an ion filtering algorithm that is used in a workflow for the untargeted analysis of metabolomic MALDI-MS images. The workflow consists of a segmentation step, followed by the ion filtering procedure, independent of the segmentation process, that detects the up/down regulated ions between image segments. Our algorithm calculates and combines three parameters: (a) the Mann–Whitney U statistical test of the ion concentration between segments [22]; (b) the FC in the ion concentration between segments; and (c) a new parameter that accounts for the proportion of pixels with undetected ions between segments. In addition, the data

from which parameters (a) and (b) are derived is obtained by previously filtering out the undetected MS signals (null values). With this methodology, we can find the key ions between any segment pair in MSI datasets, from single or multiple tissue sections. We successfully applied this workflow to the analysis of mouse brain tissue sample and to study fatty liver disease in mice liver tissue samples.

#### **2. Results**

The rMSIKeyIon package, written in R, is able to find the key ions in a pair of ROIs within or between images. The ions are selected according to the similarity parameters calculated in Appendix A and ordered following the contrast parameter, described in Appendix B. In Figure 1, there is a description of the data processing workflow, showing the main steps implemented in the rMSIKeyIon package. The spectra preprocessing and image segmentation has to be performed before and independently to the rMSIKeyIon execution. The resulting list of selected ions is related to the key metabolites exhibiting biological difference between tissue regions and reducing the candidates to identify.

In the next section, we will describe the results of the package in the analysis of a sagittal brain mouse sample, which has been segmented by k-means algorithm (Section 2.1). In particular, we will illustrate the up or down regulated ions resulting of the comparison of two clusters and the up/down regulated ions when comparing one cluster with the rest.

In the Section 2.2, we will apply the package in the identification of the fat areas in control liver samples and liver samples exposed to thirdhand smoke (THS).

#### *2.1. Results of the Brain Mouse Sample*

Figure 2 shows the number of up and down-regulated ions associated with the comparison of one particular cluster with each of the others (columns 1 to 6) in the segmented image of the brain slice tissue of C57BL/6 mouse using the k-means algorithm (n = 7 clusters). Cluster 7, identified as non-tissue section areas, has not participated in the comparisons. In column "All" appear the ions that are up-regulated (or down-regulated) in a cluster as a result of the comparison between this cluster and the rest of clusters, called "absolutely up-regulated ions" (or "absolutely down-regulated ions"). The *m*/*z* values resulting from comparisons can be available at the GitHub repository of the package (https://github.com/LlucSF/rMSIKeyIon).

**Figure 2.** Number of up or down-regulated ions associated with the comparison of one particular cluster with each of the others (columns 1 to 6) and the ions that are up-regulated (or down-regulated) in a cluster as a result of the comparison between this cluster and the rest of clusters, called "absolutely up-regulated ions" (or "absolutely down-regulated ions"). The image is composed by 6898 pixels and the number of detected ions is 277. The percentile value used for the selection of the ions is 1% for the null concentration parameter (Z) and 10% for the Mann–Whitney U (V) test and for the concentration fold change (FC). The intensity threshold for the ions is 2.5 <sup>×</sup> <sup>10</sup>−<sup>4</sup> over the normalized spectra matrix. The small lack of symmetry observed in the table is a consequence of the lack of symmetry in the distributions considered. In (**a**), the up-down regulated ions are calculated following the classical procedure, while in (**b**) the ions are calculated according the procedure described in section methods, that considers that the null values are not taken into account.

For each cluster comparison, an associated figure gives information about the resulting up or down-regulated ions, and the number of null and non-null parameters defined in the section Ion analysis and filtering (see below). The ions on the list are ordered in terms of the value of the "contrast parameter", calculated with Equation (A4) in Appendix B.

Figure 2a shows the results obtained by the classical procedure, where null values do not have a special treatment. Figure 2b corresponds to the case in which the null values are treated separately. Although both cases make use of the same processing parameters, the results are very different. Figure 2b shows a higher abundance of up-down regulated ions versus Figure 2a. In addition, the ions find in Figure 2b are of higher relevance, as can be seen in Figure S1. Figure S1 shows the two ions with the highest contrast value from the volcano plot when comparing clusters 2 and 6. Figure S1a corresponds to the classic test, and Figure S1b corresponds to the separation of the null values.

A slightly asymmetry is displayed in the tables present in Figure 2. Each parameter has its own set of discriminant values. They are obtained from the evaluation of each parameter on all the pairs of clusters without repetition. The distribution generated by the set of all these values may not be symmetric. By applying the same percentile on both tails of the distribution, non-symmetric discriminant values may arise.

#### 2.1.1. Comparison of C2 & C6

By way of example, the comparison of clusters C2 and C6 showed 63 up-regulated ions in C2 versus C6 and 16 down-regulated ions in C2 versus C6.

As an example, Figure S2 shows the volcano plot of the ions resulting from the comparison of C2 and C6. The ions at the top right and top left are selected by the ion filtering algorithm (see the caption to Figure S2 for more details).

Figure S3a shows the histogram of the concentration of the up-regulated ion with the highest contrast parameter (*m*/*z* 198.076) in C6, and Figure S3b shows the histogram of the up-regulated ion (*m*/*z* 848.636) in C2 also with the highest contrast parameter.

Figure 3a shows the segmented brain image (n = 7), and Figure 3b,c shows the concentration intensity plot of the ions mentioned above. In these intensity maps, the contrast intensity between both ions and clusters is clear, and the intensity of *m*/*z* 848.636 is much higher in C2 than in C6 and vice-versa for *m*/*z* 198.076.

**Figure 3.** (**a**) Mouse brain segmentation using k-means (n = 7 clusters), (**b**) intensity map of ion *m*/*z* 848.636 (the up-regulated ion in C2 versus C6 with the highest contrasting parameter extracted from the null concentration parameter) and (**c**) intensity map of ion *m*/*z* 198.076, the down-regulated ion with the highest contrast parameter after comparing C2 and C6, extracted from the volcano plot. The highlighted areas in (**b**,**c**) represent C2 (white contour) and C6 (red contour). (**d**) Mean spectrum (red), spectra from C2 pixels (green), and spectra from C6 pixels (pink) near *m*/*z* 848.636 and *m*/*z* 198.076. The spectra show the up-regulated and down-regulated behaviour of the ions. See also the optical image of the same brain tissue section stained with a Hematoxilyn in Figure S4.

#### 2.1.2. Absolutely Up and Down-Regulated Ions in Brain

According to the results in Figure 2b, there are 11 absolutely up-regulated ions in C2, and 34 absolutely down-regulated ions in C3. Figure 4 shows the concentration intensity plot of the two up-regulated ions (*m*/*z* 835.656 and *m*/*z* 806.633) in C2, and Figure 5 shows two down-regulated ions (*m*/*z* 868.459 and *m*/*z* 853.471) in C3 with the highest contrast parameter. There is an evident similarity between the images of the two up-regulated ions for one hand and two down-regulated ones for the other one. A comparison of the images in Figure 4 with the distribution of C2 in the brain are clearly similar. And the same is true of a comparison of the images in Figure 5 with the distribution of C3 in the brain.

**Figure 4.** Concentration images of the two absolutely up-regulated ions in C2. (**a**) *m*/*z* 835.656; (**b**) *m*/*z* 806.633.

**Figure 5.** Concentration images of two absolutely down-regulated ions in C3. (**a**) *m*/*z* 868.458; (**b**) *m*/*z* 853.471.

#### *2.2. Results of the Liver Samples*

The methodology used in this article has been applied to the study of non-alcoholic fatty liver disease in mice exposed to thirdhand tobacco smoke (TBS) [23]. We have taken a total of six images from the liver samples (three from a control mouse and three from a THS-exposed mouse). The images has been segmented using the k-means algorithm (n = 6 clusters). The results of rMSIKeyIon algorithm showed that cluster 2 (C2) has an elevated number of ions in the lipid mass range that are absolutely up-regulated, and we hypothesized that this cluster represents the lipid droplet areas characteristic of the fatty livers (see Figure 6) and the full segmented image (see Figure S5). The THS exposed mouse has the largest area, while the control animals have the smallest, in accordance with Martins-Green et al. [23]. In addition, the Figure S6 is an optical image of a selected area of a tissue section of a control and a THS exposed mouse stained with an Oil Red O protocol. It can also be observed the higher density of lipid droplets in the THS exposed sample.

**Figure 6.** Representation of cluster 2 of the six liver samples: (**a**) the three analytical replicates of a control mouse and (**b**) the three replicates of a thirdhand smoke (THS)-exposed mouse.

Table S1 shows the compounds in C2 putatively identified after a manual curation process. As can be observed, most of them are putatively identified as triglycerides or phosphatidylcholine. In Figure S6, there is the intensity map of the triacylglycerol (50:30), which is highly similar to the geometry of C2.

#### **3. Discussion**

Here, we developed a new methodology for the untargeted analysis of MS images that can be used coupled with any segmentation process and an ion filtering algorithm based on the combination of three parameters: (a) The ratio of ions with a null concentration between the regions, (b) the U Mann–Whitney U Test, calculated by segregating the non-detected ions from the distribution, and (c) the FC between the medians of the distribution (the non-detected ions were also segregated from the distribution). This methodology has proved to be efficient at finding the up/down-expressed ions in an intra-image analysis or in the comparative analysis of groups of images. The presented workflow is different to previously released software tools due to two main reasons: (a) it is flexible and independent to the segmentation process, so the ion selection process can be applied to any clustering algorithm or manually drawn ROIs. (b) Our methodology provides a completely automated ion filtering approach enabling the fast detection of a morphological region characteristic ions.

The results on the sagittal mouse brain sample show that an unsupervised clustering process followed by the rMSIKeyIon algorithm is able to select the (possible) up/down-regulated ions between any pair of clusters, in a holistic approach, and between one cluster and the rest. The concentration maps of the selected ions, ordered by the contrast parameter, depicts faithfully the morphology of the brain. These ions are probably biologically relevant and could be interesting to identify.

Using the described methodology, we have been able to detect the regions containing the lipid droplets in the liver samples from mouse exposed to THS. The putative identification of the key up-regulated ions in the cluster 2, mainly triglycerides and phosphatidylcholines, confirm that THS exposure conducts to the apparition of fatty liver disease in mice [23].

Untargeted metabolomics data analysis workflows are associated to standard analytical platforms (LC-MS, GC-MS, and NMR) [24]. These analyses compare the concentrations of chemical compounds in a CASE and a CONTROL group in order to discover features that they express differently and which could be used as biomarkers or in biological pathway analysis. In general, the number of samples (n) of each experimental group are similar, the distribution is normal (for large n values), and the principle of independent measures is assumed. However, in spatial metabolomics, the number of samples in every group (i.e., the number of pixels in an ROI) is not determined a priori, as in metabolomics studies.

Untargeted image analysis has two main applications:

(a) The comparison of two regions inside the same tissue section (intra-image analysis) to find the relevant ions. This could be used to discover cancer biomarkers by comparing the ion profile of the tumorous area with a non-tumorous area from the same sample. In general, the areas to be compared are determined by a histopathologist annotating a consecutive tissue section. The size of the ROIs in which we will compare the ions is determined manually.

(b) For several reasons, the analysis of morphologically equivalent regions in different tissues in a case-control experiment is much more complicated. First of all, the tissue samples to be compared between groups are equivalent but not similar because of the biological differences between the animals and the intrinsic difficulty of achieving identical tissue sections. Consequently, it is not straightforward to delimit the areas to be compared. The ROIs to be compared can be determined by histological annotation (supervised process), or automatically by means of a segmentation process (unsupervised process). In both cases, there are not established rules, and the following steps in the statistical analysis of the ions between ROIs can be highly affected by this fact.

In both cases, it is very common to find skewed ion distributions and a high percentage of null values, a high degree of autocorrelation between pixels, and a very high number of observations (pixels). This leads to extremely low *p*-values when classical parametric or non-parametric statistical tests are used [25], so these tests are not appropriate for this kind of analysis. For all the above reasons, the untargeted analysis of images remains a challenge. However, the results shown by rMSIKeyIon R package have been revealed to be very useful to find the most differential ions between ROIs. The biological relevance of these ions has been validated in a fatty liver study with animal models.

#### **4. Materials and Methods**

#### *4.1. Materials*

Indium tin oxide (ITO)-coated glass slides were obtained from Bruker Daltonics (Bremen, Germany). The gold target used for sputtering coating was obtained from Kurt J. Lesker Company (Hastings, England) with a purity grade higher than 99.995%. HPLC grade xylene was supplied by Sigma–Aldrich (Steinheim, Germany), and ethanol (96% purity) was supplied by Scharlau (Sentmenat, Spain).

#### *4.2. Methods*

#### 4.2.1. Sample Preparation

Mice models were developed at the Department of Molecular, Cell, and Systems Biology at the University of California Riverside [23]. Animal experimental protocols were approved by the University of California, Riverside, Institutional Animal Care and Use Committee (IACUC). The animal use protocol is A3400-01. The suitability of the workflow presented here to determine significant ions between ROIs from the same tissue was tested in a brain sample from a 6-month-old C57BL/6 mouse feed with a standard chow diet (percent calories: 58% carbohydrates, 28.5% protein, and 13.5% fat). To test the suitability of the method in different tissue sections in a case versus control experiment, we used liver samples from mice exposed to THS—the residual particles and gases from tobacco smoke that remain in dust and surfaces—from weaning (three weeks of age) to 24 weeks, without exposure to secondhand smoke (SHS) at any time during the study, and compared them with liver samples of mice that had not been exposed to THS (control group) [26]. Brain and liver samples were snap frozen at −80 ◦C after collection and stored and shipped at this temperature until analysis.

For MSI acquisition, the tissues were sectioned at −20 ◦C in slices 10 μm thick using a Leica CM-1950 cryostat (Leica Biosystems, Nussloch, Germany) located at the Centre for Omics Sciences (COS) of the Rovira i Virgili University and mounted on ITO slides by directly placing the glass slide onto the section at ambient temperature. To remove residual humidity, samples were dried in a desiccator under vacuum for 15 min after tissue mounting.

#### 4.2.2. Deposition of Au Nanolayers for LDI-MS Imaging

Gold nanolayers were deposited on the 10 μm tissue sections using an ATC Orion 8-HV sputtering system (AJA International, N. Scituate, MA, USA) [27]. Briefly, an argon atmosphere with a pressure of 30 mTor was used to create the plasma in the gun. The working distance of the plate was set to 35 mm. Sputtering conditions for MS were ambient temperature, and RF mode at 60 W for 50 s. The argon ion current was adjusted to 20 mL min <sup>−</sup>1.

#### 4.2.3. LDI-MS Acquisition

One image of a sagittal brain tissue section and six liver tissue sections (three slices from a control animal and three sections from a THS-exposed animal) were acquired using a MALDI TOF/TOF UltrafleXtreme instrument with SmartBeam II Nd:YAG/355 nm laser from Bruker Daltonics, also at the COS facilities. Raster sizes of 80 and 20 μm were used for the brain and liver tissue sections, respectively. The TOF spectrometer operated in reflectron positive mode with the digitizer set at a sample rate of 1.25 GHz in a mass range between 70 and 1200 Da. The spectrometer was calibrated prior to tissue image acquisitions using [Au]<sup>+</sup> cluster MS peaks as internal mass references [27].

#### 4.2.4. MSI Data Processing and Image Segmentation

The MSI data acquired with Bruker's FlexImaging 3.0 software was exported to XMASS data format using instrument manufacturer software packages (FlexImaging and Compass export). The raw data was loaded using the in-house rMSI package [28]. This package provides a data storage format based on segmented matrices and optimized for processing large MSI datasets in R language. Next, we applied our complete MSI pre-processing workflow consisting of spectral smoothing, alignment, mass recalibration, peak detection and peak binning [29] with the default parameters: Savitzky–Golay kernel size of 7, peak detection threshold SNR of 5, and peak binning tolerance of 6 scans with 5% filter. At this point, we obtained a peak matrix object of each MSI dataset: the brain tissue sagittal section and the liver tissue sections. These peak matrix objects are highly reduced, robust, and accurate representations of all the MSI data and can be used to perform complex statistical analyses on the huge amount of data generated in the MSI experiment. ROIs were generated by means of a k-means process. Finally, we applied the rMSIKeyIon workflow using the peak matrices as the input data.

#### 4.2.5. Ion Analysis and Filtering

The procedure used for identifying statistically different ions compared the concentration distributions of the ions in all possible pairs of ROIs in which the tissue (or tissues) had been segmented.

In general, the total number of pixels in each ROI is different and the probability density function of the ion concentrations is not normal. We used the Mann–Whitney U test [22] because it can test the null hypothesis (both sets of samples come from the same distribution) of two non-normal distributions that have a different number of observations.

In addition, in non-normal distributions of different sample sizes, there is usually a singular element: In some ROIs, there is a considerable possibility that the distribution of some ions will have small concentration values. Figure S8 represents the percentage of non-detected ions in the segmented brain image, using the k-means algorithm with n = 7 clusters. It can be observed that for some clusters (i.e., cluster 7) the percentage is very high.

For purposes of illustration, Figure S9 shows a simulated histograms of an ion in two different clusters with samples taken from normal distributions, with different average values, to which significant amounts of null values have been added. In total, there are 200 samples for both cases. Both distributions appear to be very different and the Mann–Whitney U test yields a very high *p*-value (0.38). The idea we have worked on here is to segregate the values obtained from non-detected ions (null values) from the rest of the distribution so that they can be treated separately. Thus, we obtain a very small *p*-value (of the order of 1 <sup>×</sup> 10−43). On the other hand, the percent of null values in each ROI also provides valuable information. For these reasons, we decided to segregate the null values from the ion matrix and use them to calculate a parameter (null concentration parameter), as will be explained below.

The calculation of the null concentration parameter, as well as the non-null parameters (Mann–Whitney U distribution and FC), are described in Appendix A.

Once the ions were selected using the two procedures described above, they were ordered in terms of the contrast generated by every ion between one ROI and the set of other ROIs. The procedure is described in Appendix B.

The ion filtering algorithm described in this section has been implemented in the R package named rMSIKeyIons, accessible at (https://github.com/LlucSF/rMSIKeyIon). The software's source code was written in C++ and requires the GNU Scientific Library (GSL) (https://www.gnu.org/software/gsl). Later, it was ported to R using the Rcpp R package. As input, the function requires an rMSIproc peak matrix, a previously calculated segmentation and the percentiles for each parameter, and as output, the function returns a list containing the ions for each comparison between all pair of clusters and the data related with those ions.

#### 4.2.6. Metabolite Identification

The obtained list of up regulated lipids for mice liver samples in cluster 2 was matched with the HMDB 4.0 [30] database within a tolerance of 20 ppm and the possible ion adducts: H, Na, K, and NH4. Results were filtered using the biological information of molecules provided by the HMDB, thus metabolites with no biological origin or not likely to be found in liver were discarded.

#### **5. Conclusions**

In this study, we developed the ion filtering R package rMSIKeyIon. It is open source, publicly available, and based on the combination of three parameters: the non-detected ion concentration ratio, the Mann–Whitney U ion concentration test, and the FC in the ion concentration. The null values were discarded before computing the last two parameters.

We demonstrated that our tool is very effective at discovering up or down-regulated ions between clusters using an unsupervised k-means procedure. The ions selected are the candidates that, subsequently, have to be identified. This package is a valuable tool for the untargeted analysis of MALDI images and is an important advance in this area because, at present, there are no tools available.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2218-1989/9/8/162/s1. The brain dataset, the used clustering and a R script containing instructions about the installation and the testing of the package accompanied with a document containing illustrative figures. Also the results of the method are included.

**Author Contributions:** X.C. and E.d.C. designed and conducted the research. M.M.-G. designed the animal model experiments, and generated and collected the mice samples, and M.S. processed the liver and brain samples. P.R. acquired the images and processed the data, N.R. supervised the biological interpretation and S.T. worked on the putative identification of the metabolites in the liver samples. E.d.C. and L.S. programmed the ion filtering

routine software. E.d.C., X.C. and N.R. wrote the article and L.S. was in charge of the illustrations. All the authors revised the manuscript for important intellectual content and read and approved the final manuscript.

**Funding:** This study has been supported by the Spanish Ministry of Economy and Competitiveness through projects TEC2015-69076-P and RTI2018-096061-B-I00, PR's predoctoral grant No. BES-2013-065572 and the General Directorate of Research of the Government of Catalonia through project 2017 SGR 1119. Animal model development was funded by the Tobacco Research Disease Related Program (TRDRP) of the University of California under projects 22RT-0121 and 23DT-0103.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A Calculation of the Similarity Parameters between ROIs**

In order to determine the ions that are expressed differently in two given ROIs, we calculate three parameters:

(a) The null concentration parameter (Z parameter)

The *Zijk* parameter is calculated according to Equation (A1):

$$Z\_{i\backslash\mathbb{k}} = \frac{\frac{Nz\_{ij}}{N\_j}}{\frac{Nz\_{ik}}{N\_k}} \forall i \in I; \forall j, k \in S\_{p\prime} \tag{A1}$$

where *Zijk* is the parameter that accounts for the null values (i.e., the non-detected values) of the *i* ion when comparing the *j* and *k* ROIs; *Nzij* and *Nzik* are the number of pixels with null values of the *i* ion in *j* and *k* ROIs, respectively; *Nj* and *Nk* are the total number of ROI pixels in *j* and *k*, respectively; *I* is the set of ions and *Sp* is the set of ROIs.

The equation calculates the ratio between the null values of a particular ion in the two ROIs. A value of *Zijk* > *Zhigh* (*Zhigh* being a positive value greater than 1) means that the *i* ion is more expressed in *k* ROI than in *j* ROI, while *Zijk* < *Zlow* (*Zlow* being a positive value much lower than 1) means that the *i* ion is less expressed in *k* ROI than in *j* ROI.

The importance of this parameter is assessed in Figure S7. For clusters 1 to 7, we plotted, the percentage of pixels that have null concentration for every ion.

The *Zhigh* and *Zlow* values are calculated by following these steps:


(b) Non-null concentration parameters (V parameters)

Provided that the distribution of the ions concentration is non-normal, we considered the *U* Mann–Whitney U test and the concentration FC between two ROIs, as a non-null concentration parameters.

Generally speaking, if *Nj* and *Nk* are high, the random variable *U* can be regarded as normally distributed [22]. The *Uijk* parameter is then normalized following Equation (A2):

$$V\_{ijk} = \frac{\mathcal{U}\_{ijk} - m\_u}{\sigma\_u},\tag{A2}$$

where *mu* and σ*<sup>u</sup>* are the average and standard deviation of zero *Uijk* and *Vijk* is a random variable with a normalized Gaussian distribution. If *V* has values close to 1 the similarity between the distributions is high, while values close to zero indicate disparate distributions. The value obtained for *V* indicates the similarity between the distributions of two ROIs for an ion.

Another parameter often used to compare sets of magnitudes is the FC, defined as the ion median concentration quotient between two ROIs Equation (A3):

$$F\mathbf{C}\_{\mathrm{ijk}} = \frac{M\_{\mathrm{ij}}}{M\_{\mathrm{ik}}} \, \mathrm{} \tag{A3}$$

where *Mij* is the distribution median of the *i* ion in *j* ROI and *Mik* is the same for *k* ROI. For every *i* ion, the *FCijk* parameter is calculated between the *j* and *k* ROIs. For a pair of ROIs, a Volcano plot [31] can be drawn from the *V* and *FC* parameters.

In this representation, the position occupied by the ions is important: the ions located in the top corners generate very different distributions in the two ROIs. The ions at the top left are under-expressed (*Vijk* < *Vhigh* ∧ *Fcijk* < *Fclow*) and the ions at the top right are over-expressed (*Vijk*< *Vhigh* ∧ *Fcijk* >*Fchigh*).

The values *Vhigh*, *Fchigh* and *Fclow* are calculated following the same steps as for *Zhigh* and *Zlow*, but with a difference in the percentile value. The ions located in the areas of interest must satisfy the probability of being within a range associated with two random variables; that is to say:

*P* - *Vijk* <sup>≤</sup> *Vhigh*, *Fcijk* <sup>≤</sup> *Fclow* for under-expressed ions and *<sup>P</sup>* - *Vijk* <sup>≤</sup> *Vhigh*, *Fcijk* <sup>≥</sup> *Fchigh* for over-expressed ions. Assuming that these are independent random variables, we obtain *P* - *Vijk* <sup>≤</sup> *Vhigh* = *P* - *Fcijk* <sup>≤</sup> *Fclow* = *P* - *Fcijk* <sup>≥</sup> *Fchigh* <sup>=</sup> <sup>√</sup> *Pz*/100. That is, the percentile that has to be used to determine the cutoff values in the volcano plot should be *PV* = 10· √ *PZ*

#### **Appendix B Determination of the Discriminating Figure Values and Generation of the Discriminant Ions Lists**

The contrast parameter *Cij*∨*Sp* of the *i* ion between the *j* ROI and all the ROIs (set *Sp* is calculated according to Equation (A4)):

$$\mathbb{C}\_{ij \vee S\_p} = \frac{\frac{1}{N\_j} \sum\_{p=1}^{N\_j} m\_{ip}^j}{\frac{1}{N} \sum\_{k=0}^{N\_{S\_p}} \sum\_{p=1}^{N\_k} m\_{ip}^k} \, \tag{A4}$$

where *N* is the total number of pixels in *Sp*, *Nj* and *Nk* are the number of pixels in the *j* and *k* ROIs respectively. *NSp* is the total number of ROIs in set *Sp*, *<sup>m</sup><sup>j</sup> ip* and *<sup>m</sup><sup>k</sup> ip* are the magnitude of the *i* ion in pixel *p* of the *j* and *k* ROI, respectively. The list is ordered according to the *Cij*∨*Sp* , assuming that high values mean high contrast and vice-versa.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Metabolites* Editorial Office E-mail: metabolites@mdpi.com www.mdpi.com/journal/metabolites

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18