1. Introduction
High-performance liquid chromatography in the reverse-phase separation mode (RP-HPLC), accounts for more than 90% of separations in modern analytical laboratories [
1]. Prediction of LC retention time has become valuable, powerful, and routine in chromatographic method development. Depending on the experimental design, researchers may be directly interested in retention data, or may use them to infer additional information. The quantitative structure-retention relationships (QSRR) model provides significant additional insight into the relationship between the molecular structure and fundamental phenomena in chemistry.
QSRRs can be useful in a variety of applications, such as identification of the most useful structural descriptors that describe the retention mechanism, prediction of retention time of new analytes, and the identification of unknown analytes. It can also be used for the quantitative comparison of separation properties of different chromatographic columns; for evaluation of physical properties, such as lipophilicity or dissociation constants; as well as estimation of relative bioactivities of xenobiotics [
2].
An applicable retention time prediction model not only needs to be predictive, but also needs to be able to predict analytes in the correct order. Suppose one has two close chromatographic peaks 0.1 min apart, as schematically depicted in
Figure 1. For instance, 5.1 min for analyte A, and 5.2 min for analyte B. The following three cases yield identical retention time errors, but different elution order errors: predicted retention times: (i)
tR,A = 5.0 min,
tR,B = 5.1 min; (ii)
tR,A = 5.2 min,
tR,B = 5.3 min; and (iii)
tR,A = 5.2 min,
tR,B = 5.1 min. In the aforementioned example, cases (i) and (ii) still remain satisfactory as the order has not shifted. The third case, however, involves a shift in the elution order despite the identical error in retention order. Such a simple example already emphasizes the problems that this work aims to solve.
This becomes especially evident in the case of very complex mixtures with hundreds of compounds (such as peptides in a proteomic mixture) whereby the peak capacity greatly increases. In RP-HPLC, solutes are eluted in order of decreasing polarity. However, it is often not as straightforward, and a retention time prediction model utilizing hydrophobicity/lipophilicity as a predictor is often insufficient to provide a clear picture of the retention mechanism. Thereby, even though the model may predict the retention time with a reasonable error, the predicted order of analytes for complex analytical mixtures containing thousands of close or even overlapping peaks is often very poor. Approaches to solve this complex chromatographic problem in RP-LC are quite rare, and outlined in a few studies in the literature [
3,
4,
5,
6]. In a QSRR study involving solvatic sorption parameters [
7] for prediction of retention of phenylisothiocyanate derivatives of 25 natural amino acids, Vorslova et al. [
3] report a retention error of <6%. While the predicted elution order mostly concurred with the experimental one, for retention times >15 min the QSRR model yielded larger deviations, with a few unresolved peaks.
Another example of elution order prediction is in a work of Shinoda et al [
4]. The authors utilized artificial neural networks (ANNs) to predict retention times of peptides (<50 amino acids) and report a reasonable model for 834 peptides (with
R2 of 0.928). Although not the first time the use of ANNs have been reported in retention time prediction in proteomics [
8,
9], Shinoda et al. [
4] are the first to report the prediction of the elution order with an error of elution order <11%. The method itself does not integrate the elution order prediction directly into the modelling process and it is rather vaguely described.
Recently, a complex methodology was presented by Bach et al. [
5] for elution order prediction in metabolomics using support vector machines [
10] and dynamic programming [
11]. The approach was based on molecular fingerprints of two molecules as the input and the elution order as the output. Although quite promising, the method is computationally expensive, whereas the predictions of the elution order are sensitive to the number of training samples and their composition [
5].
In this work, we define the elution order prediction as a multi-objective optimization problem with two objective functions: percentage root mean square error (%RMSE) of retention time and that of elution order. It is directly implemented within the QSRR modelling process, where regression coefficients are determined through multi-objective optimization (MOO) employing the two objectives, root mean square errors of retention time and elution order. MOO was used to obtain QSRR models which compromise prediction errors in favor of the enhanced elution order prediction. The developed method was applied to analysis of two mixtures: (i) separation of organic molecules on Supelcosil LC-18 column, and (ii) separation of peptides in seven columns under varying chromatographic conditions.
2. Results and Discussion
This work presents a method for the prediction of the elution order in RP-LC based on QSRRs defined as an MOO problem. Two case studies have been evaluated in eight RP-LC columns and under various experimental conditions. Since the current study represents a proof-of-concept for preliminary evaluation of our method, data for two case studies were taken from two literature sources. Namely, the works of Kaliszan et al. [
12] and Bączek et al [
13]. Case study 1 (CS1) was represented by a single chromatographic column (Supelcosil LC-18). Case study 2 (CS2) on the other hand was represented with multiple columns and gradients. Inclusion of all the chromatographic columns into our conceptual work was to exhibit the versatility of the approach and to emphasize the differences in elution order predictability.
In both case studies, for all the RP-LC columns, sacrificing retention time predictive ability (
Figure 2A) resulted in a considerable increase in prediction of elution order (
Figure 2B). As evident from
Figure 2C and
Table 1, the maximum relative increase in %RMSE(
tR) is a little over 15%, whereas there is up to > 80% decrease in %RMSE(order). The molecular descriptors, experimental and predicted retention times for both MLR and MLR-MOO are summarized in
Tables S2 and S3, respectively.
Table 2 summarizes the physicochemical parameters of the investigated columns. Namely, column length, internal diameter, particle size, carbon load, pore size, and surface area. When it comes to predictability of selectivity, and by extension the elution order, column length and internal diameter are very important parameters. When the dwell volume is high, a low flow rate makes it more difficult to predict elution order, because of delayed gradient elution. Larger internal diameter leads to higher flow rates, whereas a decrease in the internal diameter is favorable for faster analyses but results in an increase in back pressure which may also adversely affect the predictability of the elution order. Column length positively affects analysis time, while decreasing resolution. For higher column lengths, elution order may be more predictable. However, elution order predictability is more crucial to columns with lower lengths due to decreased peak spacing.
From
Table 2 it can be observed that most of the columns have the same internal diameter (4.6 mm) except PLRP-S with a value of 4.1 mm. On the other hand, all the columns are reasonably long (>150–200 mm), apart from Chromolith and LiCrospher CN which have a length of 100, and 125 mm, respectively. For LiCrospher CN, this fact is reflected in considerably higher error in elution order for the control MLR model (%RMSE(order) = 195.82%). Columns with other lengths exhibit various %RMSE(order) values, with some as high as 241.63% for Discovery RP Amide C16 with a 20-min gradient and a temperature of 80 °C. These kinds of inconsistencies may have occurred because of the missing dwell volume information not reported in the works of Kaliszan et al. [
12] and Bączek et al. [
13] from which the retention data were obtained. In QSRR modelling, physicochemical parameters of the column and the column itself are kept constant and it is assumed that retention depends mostly on the molecular structure of the analytes [
14,
15]. Lack of generalizability and dependence on a pre-defined set of chromatographic conditions is not only the main limitation of QSRR, but of QSAR and QSPR. However, such dependence diminishes the influence of physicochemical parameters of the column on prediction of retention time and order.
Out of the eight RP-LC columns evaluated, results of two representative columns for each case study (first case study—CS1: Supelcosil LC-18,
tG = 10 min,
T = 35 °C; and second case study—CS2: Xterra MS C18,
tG = 20 min,
T = 40 °C) are summarized in more detail as an example. For the MLR-MOO models, the optimal solution (solution 19, %RMSE(
tR) = 8.670%, %RMSE(order) = 43.679%) for CS1 (Supelcosil LC-18 column) is shown in
Figure 3A, while the optimal solution (solution 16, %RMSE(
tR) = 11.631%, %RMSE(order) = 19.820%) for one of the representative columns of the second case study (Xterra MS C18 column) is shown in
Figure 3B. As mentioned in one of the subsequent sections (
Section 3.7), it is up to the user to set an upper bound on the loss of retention predictive ability. For instance, for CS1 involving the Supelcosil LC-18 column the limit was set to ~10% from the knee point value. For the Supelcosil LC-18 column the increase in %RMSE(
tR) was 8.67, whereas the decrease in %RMSE(order) was 4.43%. One may argue that the point with the lowest %RMSE(
tR) is the optimal one (
Figure 3A). However, if indeed deemed optimal, the point decreases the %RMSE(order) by 14.94, while at the same time increases the %RMSE(
tR) by a sizeable 19.23%. Similarly, the MOO Pareto front was analyzed for the Xterra MS C-18 (
Figure 3B) column of CS1 where the limit was set to ~5%. Consequently, for an increase of %RMSE(
tR) by 3.65, a decrease in the %RMSE(order) of 4.82% was achieved.
We reiterate that setting an upper bound on the increase of %RMSE(
tR) is a user-defined parameter, and in such a case there is “no free lunch” [
16].
Predictive ability in terms of both retention time and elution order and the corresponding (analytical chemical) domains of applicability for the optimal MLR(-MOO) QSRR models for the representative columns of CS1 and CS2 for MLR(control) and MLR-MOO calculations are depicted in
Figure S1 and
Figure 4, respectively. Reasonable performance is exhibited both for CS1 (
Figure 4A–C), and CS2 (
Figure 4D–F).
Nearly all the analytes within both case studies are within their respective applicability domains (
Figure 4C,F) with only one structurally important analyte which is predicted well and was included in the training set. Thereby, the developed models are deemed stable and robust to structurally more distant analytes. Performance of all the columns in CS2 involving the separation of synthetic peptides in terms of elution order predictions was ranked using SRD analysis.
Figure 5 depicts the ranking values for all the columns in CS2. Validation was performed by computing SRD values for normally distributed random numbers. It can be observed that the ranking of all the QSRR models of the second study is statistically significant because their respective SRD values are on the far-right side of the Gaussian curve. A few columns stand out, especially the Discovery RP Amide C16, in all the combinations of gradients and column temperatures, which exhibited that errors in the elution order are up to 70% even after MOO. It is then not surprising that they are ranked among the last. Even in the original work of Bączek et al. [
13] from which the data have been obtained the models in question performed poorly. It is worth noting that these columns are highly polar and their respective polar intramolecular interactions with the analytes are difficult to quantify [
13]. Therefore, the current QSRR model was not able to capture the retention mechanism in its entirety. Nevertheless, our method has still decreased the %RMSE(order) approx. three-fold, from ~200% to ~70%.
As for the other columns of the second case study, there are a few interesting examples. For instance, for the PLRP-S column, %RMSE(
tR) systematically increases as the gradient decreases from 60 to 20. On the other hand, for the LiChrospher RP-18 column it is the opposite. %RMSE(
tR) for this column increases with the increase of the total gradient time. The increases in the errors may be due to the absence of isomeric peptides, so the QSRR models have not accounted for the proximity effects of two or more identical amino acids in the peptides’ primary sequence. Retention of the peptides, such as the “AA” peptide, do not necessarily correlate strongly with a simple sum of their respective gradient time “basis sets”. The linear model applied in this work may not be sufficient to account for this behavior. In both cases the analytes exhibiting lower retention times exhibit a degree of non-linearity between the parameters encoding the molecular structure of the peptides and retention. More complex machine learning methods should be applied to successfully capture these complex, often non-linear relationships. The somewhat inconsistent results are also due to the fact that the gradients need to be carefully optimized for a particular separation [
13]. On the other hand, the implication of this conclusion is that both QSRRs and our elution order prediction method can assist in localizing optimal conditions in RP-LC method development.
Finally, the developed methodology has great potential for practical applications once it is comprehensively validated on new high-quality retention data of complex mixtures, such as proteomics and metabolomics mixtures. We believe that not only retention time predictions, but also predictions of the elution order have the potential to greatly facilitate peptide/metabolite identification. This can be achieved by comparing classification rates (expressed as false detection rate) before and after implementation of elution order prediction within an RP-LC-MS/MS proteomics or metabolomics workflows and is an integral segment of our future work.
3. Materials and Methods
3.1. Chromatographic Measurements
Detailed information about measurements of analysis on high-performance liquid chromatography (HPLC) are shown in
Table S1 and previous studies [
13,
17]. Briefly, all the chromatographic measurements were performed with a flow rate of 1 mL/min, and UV detection wavelengths at 214 and 223 nm. The injected sample volume was 20 μL. The mobile phase for case study 1 was methanol and 100 mM tris buffer at pH values of 2.5 and 7.2. For case study 2, gradient elution was carried out with solvent A (water with 0.12% trifluoroacetic acid) and solvent B (acetonitrile with 0.10% trifluoroacetic acid). The mobile phase was filtered through a GF/F glass microfiber filter (Whatman, Maidstone, UK) and subsequently degassed with helium during the analysis. Peptide samples were dissolved in water containing 0.10% (
v/
v) of trifluoroacetic acid.
3.2. QSRR Model Development
In this work, two QSRR models were developed, for two case studies involving simple analytical mixtures: (i) separation of organic molecules in a Supelcosil LC-18 column, and (ii) separation of peptides in seven columns under varying chromatographic conditions. In the first case study, retention times of 62 organic compounds analyzed in the Supelcosil LC-18 column (
Table S1) were utilized. The following retention time prediction model was used:
where
μ represents the total dipole moment,
δmin is the electron excess charge of the most negatively-charged atom, whereas
SASA is the solvent accessible surface area. The model defined with Equation (1) was used because it is one of the earliest, and well-described mechanistic QSRR models derived purely in silico introduced by Kaliszan et al [
12,
18]. The descriptors
μ and
δmin accounted for electrostatics. Namely, dipole-dipole and dipole-induced dipole interactions of the analyte with the mobile and stationary phases; and local polar interactions, respectively. On the other hand, SASA accounted for dispersive interactions of the analyte and the mobile and stationary phase [
12].
The dataset itself was obtained from literature [
17], while the molecular descriptors were re-calculated using density functional theory (DFT) at a higher level of theory, namely, MN15/6-311+G** [
19,
20]. Due to a pronounced solvent effect, the implicit SMD solvation model [
21] was used to model it. All the DFT calculations were performed in Gaussian 16 software (Gaussian, Inc., Wallingford, CT, USA).
The second case study comprised of retention times of 98 synthetic peptides analysed on seven different columns in varying conditions: Xterra MS C18, LiChrospher RP-18, LiChrospher CN, Discovery HS F5-3, Discovery RPAmide C16, PLRP-S and Chromolith (
Table 2 and
Table S1). The QSRR formulation used in the second case study is as follows:
where log
SumAA is logarithm of the sum of gradient retention times of the amino acids composing the peptide,
clog
P is the logarithm of its theoretically calculated
n-octanol−water partition coefficient representing hydrophobicity of the peptides, whereas log
vdWvol. is the logarithm of the peptides’ van der Waals volume. The log
SumAA descriptor, thereby, accounted for the primary structure of the peptides, whereas the
clog
P and
vdWvol explained most of the remaining variance in retention due to post-translational modification and acetylation [
22,
23,
24] Both the retention data and the molecular descriptors were obtained from literature [
13]. In the second case study, due to the sheer size of the molecular structures of most of the analytes, the descriptors were not re-calculated at a higher level of theory.
For a functional form for Equations (1) and (2), a linear equation is employed:
where
xi and
αi denote the molecular descriptors and regression coefficients, respectively, while
ε is the error. The regression coefficients are typically estimated using multiple linear regression (MLR) [
25].
3.3. QSRR Model Validation
In QSRR, model validation is the task of demonstrating that the model is a reasonable representation of the actual system; i.e., it reproduces system behavior with enough reliability to satisfy prediction objectives. In this work, both datasets were separated into training and external validation sets using the Kennard and Stone algorithm [
26] (70/30% ratio). Performance metrics, such as %RMSE, were evaluated and the predictive ability of the developed models was also depicted.
Chemical domains of applicability were also defined for all the models in both case studies. This was to ensure that the predictions were restricted to prediction of a model to compounds which possessed similar structural, physicochemical, or biological space information similar to the training compounds. For this purpose, the Williams plot, a graphical description of dependence between standardized residuals and leverages of the model was employed. Its warning limits were set as the critical leverage value
h* and three multiples of standard deviation. The critical leverage [
27] is defined as:
where
K is the number of variables, and
N is the number of observations.
3.4. Elution Order Prediction
As previously mentioned, elution order in HPLC is generally governed by polarity. Thereby, in RP-HPLC, solutes will typically be eluted in a reverse order to nonpolar compounds where retention increases with decreasing polarity. However, it is not always as straightforward. In this work, the elution order problem was formulated by considering all the peaks. Mathematically, elution order can be easily predicted once retention time is predicted and then sorted in ascending order. Specifically, indices are defined for all the experimental peaks (sorted in ascending order). Subsequently, retention times predicted by a QSRR model are sorted with respect to the experimental times and another set of indices is defined and sorted. The resulting differences between the two sets of indices define the elution order error. Therefore, it is directly implemented within the QSRR modelling process.
3.5. Multi-Objective Optimization (MOO)
Multi-objective optimization (MOO) seeks to optimize a vector-valued cost function, with more than one objective. In this work, the MOO formulation of elution order prediction was defined as:
where
α* represent optimal regression coefficients. %RMSE(
tR) and %RMSE(order) denote %RMSE of retention time and that of elution order. Upon obtaining the MLR model (control), the MLR coefficients were used as an initial point for multi-objective optimization (MLR-MOO). In this work MOO is implemented using genetic algorithms (GA) [
28,
29]. GAs are a family of optimization algorithms based on natural evolution and were chosen as a robust alternative to the classical interior-point [
30] or conjugate gradient [
31] algorithms.
The solution to the MOO problem is not a single point, but a family of points (Pareto front). Each point on the Pareto front that does not lead to degradation is optimal in the sense that no improvement can be achieved in one vector component of the cost with respect to one of the remaining components. From the obtained Pareto front(s), a solution is selected which gives a desirable trade-off between the objectives, in this case decrease of elution order error at the expense of retention time prediction. It has to be noted that the choice of an upper bound for the loss of the retention predictive ability (expressed as 100 − %RMSE) is a user-defined parameter.
3.6. Objective Functions for MOO
For MOO, %RMSE of retention time and elution order were used as objective functions for elution order prediction through MOO. %RMSE(
tR) [
15,
32] was defined as follows:
In Equation (6) tR,pred and tR,obs. stand for the predicted and experimentally-obtained retention times, respectively, while n is the number of observations. Once the predicted retention times are sorted in ascending order with respect to the experimental ones, the elution order can be easily predicted. Subsequently, %RMSE(order) had the same form as the equation for %RMSE(tR) with the experimental and predicted elution order instead of the retention time.
3.7. Selection of an Optimal MOO Solution
An optimal solution from the obtained Pareto set of solutions is selected employing the following approach. Starting from the knee point of the Pareto front, a point which represents the optimal trade-off between errors in retention time and elution order is selected. This is because the estimated Pareto front knee point may not always represent an acceptable solution to the end-user. Thereby, the optimal MOO solution can be considered as a user-defined parameter. Subsequently, the optimal solutions are compared to the control QSRR models built via MLR. Diagnostic metrics: (i) predictive performance, and (ii) (analytical chemical) domain of applicability, are used to evaluate the developed QSRR and QSRR-MOO models.
3.8. Sum of Ranking Differences
An attempt was made to utilize sum of ranking differences (SRD) [
33] for an objective selection of a solution from the Pareto front. However, for all the columns the differences were too subtle. Instead, SRD was used to directly compare the performance of all the RP-LC columns of CS2 in terms of their elution order.
SRD is a method that has been used on several occasions for comparison of chromatographic columns both in terms of column performance and QSRR model performance [
34,
35]. Essentially, SRD is an unbiased ranking method with the ability to compare models, methods, analytical techniques, and so on. It is based on the sum of squared differences in the ranking of the observed objects (e.g., chromatographic columns) and their respective observations (e.g., analytes). The ranking is obtained with respect to either a golden standard or a global statistical metric (such as mean, or median) of the observations. In this work, an extension of the original SRD method, which includes validation through a comparison of ranks of random numbers (CRNNs) has been employed [
36]. SRD-CRNNs are based on computing SRD values for a series of normally-distributed random numbers. Values of the first icosaile, median, and last icosaile are computed, as well as a sufficient number of points to plot the SRD normal distribution curve. If an object has an SRD value on either side of the curve, it is statistically significantly different than the ranking of random numbers.
3.9. Software Development
Implementation of both the GA in its multi-objective form, objective function, knee point determination, and comprehensive QSRR model validation was carried out using MATLAB 2019a (MathWorks, Sherborn, MA, USA). A graphical interface was constructed for the purpose of easing user interaction with the developed software.