*2.1. Derivative-Free Optimization*

We consider an optimization problem of a function *f* : X → R, where X ⊆ R*<sup>D</sup>* is the search space. In this paper we focus on the minimization problem, namely:

$$\mathbf{x}^\* = \arg\min\_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x}; \mathcal{D}), \tag{1}$$

where D denotes observed data.

Further, we assume that the analytical form of the function *f* is unknown or cannot be used to calculate derivatives, however, we can query it through a simulation or experimental measurements. Problems of this sort are known as *derivative-free* or *black-box* (In general, a *black-box* problem means that a formal description of a problem is unknown, however, very often non-differentiable problems with known mathematical representation (e.g., differential equations) are treated as black-box) optimization problems [5,17]. Additionally, we consider a bounded search space, i.e., we include inequality constraints for all dimensions in the following form: *ld* ≤ *xd* ≤ *ud*, where *ld*, *ud* ∈ R and *ld* < *ud*, for *d* = 1, 2, . . . , *D*.

#### *2.2. Population-Based Optimization Methods*

One group of widely-used methods for derivative-free optimization problems is population-based optimization algorithms. The idea behind these methods is to use a *population* of *individuals*, i.e., a collection of candidate solutions X = {**<sup>x</sup>**1, ... , **<sup>x</sup>***N*}, instead of a single individual in the iterative manner. The premise of utilizing the population over a single candidate solution is to obtain better exploration of the search space and exploiting potential local optima [18,19].

In the essence, every population-based algorithm consists of three following steps that utilize a procedure for generating new individuals *G*, and a selection procedure *S*, that is:

**(Init)** Initialize X = {**<sup>x</sup>**1, ... , **<sup>x</sup>***N*} and evaluate all individuals F*x* = { *fn* : *fn* = *f*(*xn*), *xn* ∈ X}.

**(Generation)** Generate new candidate solutions using the current population, C = *<sup>G</sup>*(<sup>X</sup>,F*x*). **(Evaluation)** Evaluate all candidates solutions:

$$\mathcal{F}\_{\mathcal{C}} = \{ f\_n : f\_n = f(\mathbf{x}\_n), \ \mathbf{x}\_n \in \mathcal{C} \}.$$

**(Selection)** Select a new population using the candidate solutions and the old population

$$\mathcal{X} := \mathcal{S}(\mathcal{X}, \mathcal{F}\_{\mathbf{x}}, \mathcal{C}, \mathcal{F}\_{\mathbf{c}}).$$

Go to **Generate** or terminate.

> An exemplary population-based optimization approach is depicted in Figure 2.

In general, the population-based optimization methods are favorable over standard derivative-free optimization (DFO) algorithms in problems when querying the objective function is relatively cheap. Their computational complexity depends mainly on the population size, i.e., it is linear with respect to the size of the population *N*. Other DFO methods are typically more expensive. Bayesian Optimization, for instance, is known to give a good performance, but its complexity typically scales cubically with respect to the number of queries [20]. Here, we take advantage of the very low execution time of running a simulator (the glycolysis model) and propose to use the population-based methods for the parameter identification task.

There are a plethora of population-based DFO algorithms [5,6,18,21,22], however, our goal is to verify whether this approach, in general, could be successfully used in the considered task. Therefore, we decide to choose four instances of a group of methods that are easy-to-use and are proven to work well in practice: evolutionary strategies (ES), differential evolution (DE), estimation of distribution algorithms (EDA), and recently proposed reversible differential evolution (RevDE). Moreover, we propose to enhance EDA and RevDE with a surrogate model to allow better exploration and speed up calculations.

**Figure 2.** An illustration of a population-based optimization of a quadratic function (blue solid line) using the Estimation of Distribution Algorithm. At each generation a population is selected (blue nodes) and weakest individuals are discarded (red crosses). New candidate solutions are generated by sampling from the normal distribution fit to the previous population (orange solid line).

#### 2.2.1. Evolutionary Strategies (ES)

Evolutionary strategies can be seen as a specialization of evolutionary algorithms with very specific choices of *G* and *S*. The core of ES is to formulate *G* using the multivariate Gaussian distribution. Here, we follow the widely-used (1 + 1)-ES that generates a new candidate using the Gaussian mutation parameterized by *σ* > 0, namely:

$$\mathbf{x}' = \mathbf{x} + \boldsymbol{\sigma} \cdot \boldsymbol{\varepsilon},\tag{2}$$

where *ε* ∼ N (0, I), and N (0, I) denotes the Gaussian distribution with zero mean and the identity covariance matrix I. Next, if the fitness value of **x** is smaller than the value of fitness function of **x**, the new candidate is accepted and the old one is discarded.

The crucial element of this approach is determining the value of *σ*. In order to overcome possibly time-consuming hyperparameter search, the following adaptive procedure is proposed [21]:

$$
\sigma := \begin{cases}
\sigma \cdot c & \text{if } p\_s < 1/5, \\
\sigma / c & \text{if } p\_s > 1/5, \\
\sigma & \text{if } p\_s = 1/5.
\end{cases}
\tag{3}
$$

where *ps* is the number of accepted individuals of the offspring divided by the population size *N*, and *c* is equal 0.817 following the recommendation in [23].

#### 2.2.2. Differential Evolution (DE)

*Differential evolution* is another population-based method that is loosely based on the Nelder-Mead method [24,25]. A new candidate is generated by randomly picking a triple from the population, (**<sup>x</sup>***i*, **<sup>x</sup>***j*, **<sup>x</sup>***k*) ∈ X , and then **x***i* is perturbed by adding a scaled difference between **x***j* and **<sup>x</sup>***k*, that is:

$$\mathbf{y} = \mathbf{x}\_{i} + F(\mathbf{x}\_{j} - \mathbf{x}\_{k}),\tag{4}$$

where *F* ∈ (0, 2] is the scaling factor. This operation could be seen as an adaptive *mutation operator* that is widely known as *differential mutation* [25].

Further, the authors of [24] proposed to sample a binary mask **m** ∈ {0, 1}*<sup>D</sup>* according to the Bernoulli distribution with probability *p* = *<sup>P</sup>*(*md* = 1) shared across all *D* dimensions, and calculate the final candidate according to the following formula:

$$\mathbf{v} = \mathbf{m} \odot \mathbf{y} + (1 - \mathbf{m}) \odot \mathbf{x}\_{i\prime} \tag{5}$$

where denotes the element-wise multiplication. In the evolutionary computation literature this operation is known as *uniform crossover operator* [18]. In this paper, we fix *p* = 0.9 following general recommendations in literature [26] and use the uniform crossover in all methods.

The last component of a population-based method is a selection mechanism. There are multiple variants of selection [18], however, here we use the "survival of the fittest" approach, i.e., we combine the old population with the new one and select *N* candidates with highest fitness values, i.e., the deterministic (*μ* + *λ*) selection.

This variant of DE is referred to as "DE/*rand*/*1*/*bin*", where *rand* stands for randomly selecting a base vector, *1* is for adding a single perturbation and *bin* denotes the uniform crossover. Sometimes it is called *classic DE* [25].

#### 2.2.3. Reversible Differential Evolution (RevDE)

The mutation operator in DE perturbs candidates using other individuals in the population to generate a single new candidate. As a result, having too small population could limit exploration of the search space. In order to overcome this issue, a modification of DE was proposed that utilized all three individuals to generate three new points in the following manner [27]:

$$\begin{aligned} \mathbf{y}\_1 &= \mathbf{x}\_i + F(\mathbf{x}\_j - \mathbf{x}\_k) \\ \mathbf{y}\_2 &= \mathbf{x}\_j + F(\mathbf{x}\_k - \mathbf{y}\_1) \\ \mathbf{y}\_3 &= \mathbf{x}\_k + F(\mathbf{y}\_1 - \mathbf{y}\_2) \end{aligned} \tag{6}$$

New candidates **y**1 and **y**2 could be further used to calculate perturbations using points outside the population. This approach does not follow a typical construction of an EA where only evaluated candidates are mutated. Further, we can express (6) as a linear transformation using matrix notation by introducing matrices as follows:

$$
\begin{bmatrix} \mathbf{y}\_1 \\ \mathbf{y}\_2 \\ \mathbf{y}\_3 \end{bmatrix} = \underbrace{\begin{bmatrix} 1 & F & -F \\ -F & 1 - F^2 & F + F^2 \\ F + F^2 & -F + F^2 + F^3 & 1 - 2F^2 - F^3 \end{bmatrix}}\_{-\mathbf{R}} \begin{bmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \\ \mathbf{x}\_3 \end{bmatrix}. \tag{7}
$$

In order to obtain the matrix **R**, we need to plug **y**1 to the second and third equation in (6), and then **y**2 to the last equation in (6). As a result, we obtain *M* = 3*N* new candidate solutions. This version of DE is called *Reversible Differential Evolution*, because the linear transformation **R** is reversible [27].

#### 2.2.4. Estimation of Distribution Algorithms (EDA)

Most of the population-based optimization methods aim at finding a solution and the information about the distribution of the search space and the fitness function is represented implicitly by the population. However, this distribution could be modeled explicitly using a probabilistic model [19]. These methods have become known as the estimation of distribution algorithms [28–30].

The key difference between EDA and EA is the generation step. While an EA uses evolutionary operators like mutation and cross-over to generate new candidate solutions, EDA fits a probabilistic model to the population, and then new individuals are sampled from this model.

Therefore, fitting a distribution to the population is the crucial part of an EDA. There are various probabilistic models that could be used for this purpose. Here, we propose to fit the multivariate Gaussian distribution N (**¯**, Σ) to the population X . For this purpose, we can use the empirical mean and the empirical covariance matrix:

$$
\hat{\mu} = \frac{1}{N} \sum\_{n=1}^{N} \mathbf{x}\_{n\prime} \tag{8}
$$

and

$$
\hat{\Sigma} = \frac{1}{N} \sum\_{n=1}^{N} (\mathbf{x}\_{\mathbb{R}} - \hat{\mu})(\mathbf{x}\_{\mathbb{R}} - \hat{\mu})^{\top}. \tag{9}
$$

An efficient manner of sampling new candidates is to first calculate the Cholesky decomposition of the covariance matrix, Σ ˆ = **LL** , where **L** is the lower-triangular matrix, and then computing:

$$\mathbf{x}' = \boldsymbol{\upmu} + \mathbf{L}\boldsymbol{\varepsilon},\tag{10}$$

where *ε* ∼ N (0, I). The Equation (10) is repeated *M* times to generate a new set of candidate solutions. Here, we set *M* to the size of the population, i.e., *M* = *N*. Once new candidate solutions are generated, the selection mechanism is applied. In this paper, we use the same selection procedure as the one used for DE.

#### 2.2.5. Population-Based Methods with Surrogate Models (RevDE+ & EDA+)

**Surrogate models:** A possible drawback of population-based methods is the necessity of evaluating large populations that, even though we assume a low time cost per a single evaluation, could significantly slow down the whole optimization process. To overcome this issue, a surrogate model could be used to partially replace querying the fitness function [31]. The surrogate model is either a probabilistic model or a machine learning model (e.g., a neural network) that gathers previously evaluated populations and allows to mimic the behavior of the fitness function. While applying the surrogate model, it is assumed that its utilization cost (e.g., training) is lower or even significantly lower than the computational cost of running the simulator.

There are multiple possible surrogate models, however, non-parametric models, e.g., Gaussian processes [20], are preferable, because they do not suffer from catastrophic forgetting (i.e., overfitting to the last population and forgetting first populations). Here, we consider another non-parametric model, namely, *K*-Nearest-Neighbor (*K*-NN) regression model that stores all previously seen individuals with evaluations, and the prediction of a new candidate solution is an average over *K* (e.g., *K* = 3) closest previously seen individuals. Current implementations of the *K*-NN regressor provide efficient search procedures that result in the computational complexity better than *N* · *D*, e.g., using KD-trees results in *O*(*D* log *<sup>N</sup>*). This computational complexity is significantly better than the computational complexity of Gaussian processes, *<sup>O</sup>*(*N*<sup>3</sup>).

**RevDE+**: In the RevDE approach, we generate 3*N* new candidate solutions and all of them are further evaluated. However, this introduces an extra computational cost of running the simulator. This issue could be alleviated by using the *K*-NN regressor to approximate the fitness values of the new candidates. Further, we can select *N* most promising points. We refer to this approach as RevDE+.

**EDA+**: The outlined procedure of EDA produces *M* new candidate solutions and to keep a similar computational cost as ES and DE, we set *M* to *N*. However, this could significantly limit the potential of modeling a search space, because sampling in highdimensional search spaces requires a significantly large number of points. A potential solution to this problem could be the application of the *K*-NN regressor to quickly verify the *L* new points. As long as the time cost of providing the approximated value of the fitness function is lower than the running time of the simulator, we can afford to take *L* > *N* (e.g., *L* = 5*N*). We refer to this approach as EDA+.

#### *2.3. The Model of Glycolysis in Saccharomyces Cerevisiae*

**Introduction**: As an example, we chose *glycolysis* that is a crucial metabolic pathway and its upregulation is correlated with diseases like cancer [9,10]. Nearly all living organisms carry out glycolysis as a part of cellular metabolism. A glycolytic path that consists of a series of reactions breaks down glucose into two three-carbon compounds and extracts energy for cellular metabolism. Therefore, glycolysis is at the heart of classical biochemistry and, as such, it is very well described. One of the most intensively studied organisms in the context of, among others, glycolysis is *Saccharomyces cerevisiae* species, also known as baker's yeas<sup>t</sup> [11–15]. Whereas, the dynamic model of glycolysis in *Saccharomyces cerevisiae* is of big interest in systems biology dynamic modeling literature [16,32–35].

**Figure 3.** The glycolysis process in the yeas<sup>t</sup> *Saccharomyces cerevisiae* proposed in [16]. There are 11 reactions governing the process with 18 parameters in total, and 9 metabolites. Blue circles depict observable metabolites, red circles denote unobservable metabolites, and green squares represent reactions. A white circle with a diagonal line corresponds to a sink. The model is taken from the JWS database [8].

We applied our optimization framework to a model of glycolysis in yeas<sup>t</sup> proposed in [16], see Figure 3, that suffices to present the essence of our framework. This model contains lumped reactions of the glycolytic pathway and includes the production of glycerol, fermentation to ethanol, and exchange of acetaldehyde between the cells, and trapping of acetaldehyde by cyanide.

**A system of Ordinary Differential Equations**: In the considered model of the glycolysis we distinguish the following metabolites: glycolysis (*glu*), fructose-1,6-bisphosphate (*fru*), triosephosphates (*triop*), triphosphoglycerate (*tp*), pyruvate (*pyr*), acetaldehyde (*ac*), external acetaldehyde (*ace*).

Following the same assumptions as in [16] (i.e., a homogeneous distribution of the metabolites in the intracellular and in the extracellular solution), the system of ordinary differential equations of the glycolysis model in *Saccharomyces cerevisiae* is the following [36]:

˙

˙

˙

˙

*ac*

=

−

> −

−

$$
\lg \bar{\mu} = \upsilon\_1 - \upsilon\_2 \tag{11}
$$

$$
\tau r u = v\_2 - v\_3 \tag{12}
$$

$$
tiop = 2\upsilon\mathfrak{z} - \upsilon\_{4} - \upsilon\mathfrak{z} \tag{13}$$

$$tp = \upsilon\_5 - \upsilon\_6\tag{14}$$

$$p\dot{y}r = v\_6 - v\_7\tag{15}$$

$$d\mathcal{c} = v\_7 - v\_8 - v\_9\tag{16}$$

$$ac = \upsilon\_7 - \upsilon\_8 - \upsilon\_9 \tag{16}$$

$$a\dot{c}c = 0.1\upsilon\_9 - \upsilon\_{10} \tag{17}$$

$$
\mu t p = -2v\_2 + v\_5 + v\_6 - v\_{11} \tag{18}
$$

$$
\dot{m}\dot{a}d = \upsilon\_4 - \upsilon\_5 - \upsilon\_8\tag{19}
$$

with the rate equations:

$$w\_1 = k\_0 \tag{20}$$

$$w\_2 = \frac{k\_1 \cdot glu \cdot at}{1 + (at/k\_i)^n} \tag{21}$$

$$w\_3 = k\_2 \cdot fru \tag{22}$$

$$v\_4 = \frac{k\_{31} \cdot k\_{32} \cdot triop \cdot nadA - k\_{33} \cdot k\_{34} \cdot tp \cdot atp \text{ N}}{k\_{33} \cdot N + k\_{32} \cdot A} \tag{23}$$

$$k = k\_4 \cdot tp \cdot A \tag{24}$$

$$w\_6 = k\_5 \cdot pyr\tag{25}$$

$$
v\_7 = k\_6 \cdot ac \cdot nad \tag{26}$$

$$v\_8 = k\gamma \cdot atp\tag{27}$$

$$v\_9 = k\_8 \cdot triop \cdot nad \tag{28}$$

$$w\_{10} = k\_9 \cdot ace\tag{29}$$

$$w\_{11} = k\_7 \cdot atp\tag{30}$$

where *A* = (*atot* − *atp*) and *N* = (*ntot* − *nad*).

*v*5
