*Proceeding Paper* **Surrogate-Enhanced Parameter Inference for Function-Valued Models †**

**Christopher G. Albert 1,2,\*, Ulrich Callies <sup>3</sup> and Udo von Toussaint <sup>1</sup>**


**Abstract:** We present an approach to enhance the performance and flexibility of the Bayesian inference of model parameters based on observations of the measured data. Going beyond the usual surrogateenhanced Monte-Carlo or optimization methods that focus on a scalar loss, we place emphasis on a function-valued output of a formally infinite dimension. For this purpose, the surrogate models are built on a combination of linear dimensionality reduction in an adaptive basis of principal components and Gaussian process regression for the map between reduced feature spaces. Since the decoded surrogate provides the full model output rather than only the loss, it is re-usable for multiple calibration measurements as well as different loss metrics and, consequently, allows for flexible marginalization over such quantities and applications to Bayesian hierarchical models. We evaluate the method's performance based on a case study of a toy model and a simple riverine diatom model for the Elbe river. As input data, this model uses six tunable scalar parameters as well as silica concentrations in the upper reach of the river together with the continuous time-series of temperature, radiation, and river discharge over a specific year. The output consists of continuous time-series data that are calibrated against corresponding measurements from the Geesthacht Weir station at the Elbe river. For this study, only two scalar inputs were considered together with a function-valued output and compared to an existing model calibration using direct simulation runs without a surrogate.

**Keywords:** parameter inference; Monte Carlo; surrogate model; Gaussian process regression; dimensionality reduction

#### **1. Introduction**

Delayed acceptance [1,2] can accelerate Markov chain Monte Carlo (MCMC) sampling up to a factor of one over the acceptance rate. In order to do so, it requires a surrogate of the posterior that contains the cost function inside the likelihood in the case of model calibration. The simplest way to implement delayed acceptance relies on a surrogate with scalar output built for this cost function or for the likelihood. Here, we take an intermediate step and construct a surrogate for the functional output of a blackbox model to be calibrated against reference data. Typical examples are numerical simulations that output time-series or spatial data and depend on tunable input parameters.

There exist numerous related works treating blackbox models with functional outputs with surrogates. Campbell et al. [3] used an adaptive basis of principal component analysis (PCA) to perform global sensitivity analysis. Pratola et al. [4] and Ranjan et al. [5] used GP regression for sequential model calibration in a Bayesian framework. Lebel et al. [6] modeled the likelihood function in an MCMC model calibration via a Gaussian process. Perrin [7] compared the use of a multi-output GP surrogate with a Kronecker structure to an adaptive basis approach.

The present contribution relies on the adaptive basis approach in principal components (Karhunen–Loéve expansion or functional PCA) to reduce the dimensions of the functional

**Citation:** Albert, C.G.; Callies, U.; von Toussaint, U. Surrogate-Enhanced Parameter Inference for Function-Valued Models. *Phys. Sci. Forum* **2021**, *3*, 11.

2021003011 Academic Editors: Wolfgang von der

Linden and Sascha Ranftl

Published: 21 December 2021

https://doi.org/10.3390/psf

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

output, while modeling the map from inputs to weights in this basis via GP regression. We demonstrate the application of this approach on two examples using usual and hierarchical Bayesian model calibration. In the latter case, a surrogate beyond the*L*<sup>2</sup> cost function is required if the likelihood depends on additional auxiliary parameters. As an example, we allow variations of the (fractional) order of the norm, thereby, marginalizing over different noise models, including Gaussian and Laplacian noise.

#### **2. Gaussian Process Regression and Bayesian Global Optimization**

Gaussian process regression [8–10] is a commonly used tool to construct flexible nonparameteric surrogates. Based on the observed outputs *f*(*xk*) at training points *x<sup>k</sup>* and a covariance function *k*(*x*, *x*- ), the GP regressor predicts a Gaussian posterior distribution at any point *x*∗. For a single prediction *f*(*x*∗), the expected value and variance of this distribution are given by

$$f(\mathbf{x}^\*) = m(\mathbf{x}^\*) + \mathcal{K}^\*(\mathcal{K} + \sigma\_n I)^{-1} \mathbf{y},\tag{1}$$

$$\text{var}[f(\mathbf{x}^\*)] = K^{\*\*} - K^\*(K + \sigma\_\mathbf{n} I)^{-1} K^{\*T},\tag{2}$$

where *m*(*x*∗) is the mean model, the covariance matrix *K* contains entries *Kij* = *k*(*xi*, *xj*) based on the training set, *K*∗ *<sup>i</sup>* (*x*∗, *<sup>x</sup>i*) are entries of a row vector, and *<sup>K</sup>*∗∗ <sup>=</sup> *<sup>k</sup>*(*x*∗, *<sup>x</sup>*∗) is a scalar. The unit matrix *I* is added with the noise covariance *σn*, which regularizes the problem and is usually estimated in an optimization loop together with other kernel hyperparameters.

Such a surrogate with uncertainty information can be used for Bayesian global optimization [11–13] of the log-posterior as a cost function. Here, we apply this method to reach the vicinity of the posterior's mode before sampling. As an acquisition function, we use the expected improvement (see, e.g., [12]) at a newly observed location*x*<sup>∗</sup> given existing training data D,

$$\begin{split} a\_{\mathrm{EI}}(\mathbf{x}^{\star}) &= E[\max(0, \boldsymbol{f}(\mathbf{x}^{\star}) - \boldsymbol{f}) | \mathbf{x}^{\star}, \mathcal{D}] \\ &= (\bar{\boldsymbol{f}}(\mathbf{x}^{\star}) - \hat{\boldsymbol{f}}) \Phi(\hat{\boldsymbol{f}}; \bar{\boldsymbol{f}}(\mathbf{x}^{\star}), \text{var}[\boldsymbol{f}(\mathbf{x}^{\star})]) + \text{var}[\boldsymbol{f}(\mathbf{x}^{\star})] \mathcal{N}(\hat{\boldsymbol{f}}; \bar{\boldsymbol{f}}(\mathbf{x}^{\star}), \text{var}[\boldsymbol{f}(\mathbf{x}^{\star})]), \end{split} \tag{3}$$

where ˆ *f* is the optimum value for *f*(*x*) observed thus far. Due to the non-linear transformation from the functional blackbox output to the value of the cost function, it is more convenient to realize Bayesian optimization with a direct GP surrogate of the cost function that is constructed in addition to the surrogate for the functional output for the KL expansion coefficients described below.

#### **3. Delayed Acceptance MCMC**

Delayed acceptance MCMC builds on a fast surrogate for the posterior *<sup>p</sup>*˜(*x*|*y*) to reject unlikely proposals early [1,2]. Following the usual Metropolis–Hastings algorithm, the probability to accept a new proposal *x*<sup>∗</sup> in this first stage in the *n*-the step of the Markov chain is, as usual,

$$\tilde{P}\_{\text{acc}}^{\text{II}} = \frac{\tilde{p}(\mathbf{x}^\* | \mathbf{y})}{\tilde{p}(\mathbf{x}\_{n-1} | \mathbf{y})} \frac{\mathcal{g}(\mathbf{x}\_{n-1} | \mathbf{x}^\*)}{\mathcal{g}(\mathbf{x}^\* | \mathbf{x}\_{n-1})},\tag{4}$$

where*g* is a transition probability that has been suitably tuned during warmup. The true posterior *<sup>p</sup>*(*x*|*y*) is only evaluated if the proposal"survives" this first stage and enters the final acceptance probability

$$P\_{\rm acc}^{\rm tr} = \frac{p(\mathbf{x}^\* | y)}{p(\mathbf{x}\_{n-1} | y)} \frac{\vec{p}(\mathbf{x}\_{n-1} | y)}{\vec{p}(\mathbf{x}^\* | y)}. \tag{5}$$

Actual computation is typically performed in the logarithmic space with a cost function

$$\ell(\mathfrak{x}|\mathfrak{y}) \equiv -\log p(\mathfrak{x}|\mathfrak{y}).\tag{6}$$

If this function is fixed, it is most convenient to directly build a surrogate ˜ (*x*|*y*) for the log-posterior (*x*|*y*) including the corresponding prior.

#### **4. Bayesian Hierarchical Models and Fractional Norms**

One application of modeling the full functional output instead of only the cost function is the existence of additional distribution parameters *θ* in the likelihood in addition to the original model inputs *x*. Such dependencies appear within Bayesian hierarchical models [14], where *θ* are again subject to a certain (prior) distribution with possibly further levels of hyperparameters. There are essentially two ways to construct a surrogate with support for additional parameters *θ*: Building a surrogate for the cost function that adds *θ* as independent variables or constructing a surrogate with functional output for *fk*(*x*) and keeping the dependencies on *θ* exact. Here, we focus on the latter, and apply this surrogate within delayed acceptance MCMC with both, *x* and *θ* as tunable parameters.

As an example, we use a more general noise model than the usual Gaussian likelihood that builds on arbitrary *<sup>θ</sup>* norms [15–17] with real-valued *θ* not fixed while traversing the Markov chain. We allow members of the exponential family for observational noise and specify only its scale, but keep *θ* as a free parameter. Namely, we model the likelihood for observing *y* in the output as

$$p(y|\mathbf{x},\theta) = \frac{1}{2\sqrt{2}\sigma\,\Gamma(1+\theta^{-1})}e^{-\ell\left(y;\mathbf{x},\theta\right)},\tag{7}$$

with the normalized *<sup>θ</sup>* norm to the power of *θ*,

$$\ell(y; \mathbf{x}, \theta) \equiv \frac{1}{D} \sum\_{i=1}^{D} \left| \frac{y\_i - f\_i(\mathbf{x})}{\sqrt{2}\sigma} \right|^{\theta} \tag{8}$$

as the loss function between observed data *yi* and blackbox model *fi*(*x*). Choosing the usual *L*<sup>2</sup> norm leads to a Gaussian likelihood for the noise model, whereas using the *L*<sup>1</sup> norm means Laplacian noise. To maintain the relative scale when varying *θ*, it is important to add the term log Γ(1 + *θ*−1) from (7) to the negative log-likelihood. In the following use cases, we are going to compare the cases of fixed and variable *θ*.

#### **5. Linear Dimension Reduction via Principal Components**

Formally, the blackbox output for given input *<sup>x</sup>* can be a function *<sup>f</sup>*(*t*) <sup>∈</sup> <sup>H</sup> in an infinite-dimensional Hilbert space (though sampled at a finite number of points in practice). Linear dimension reduction in such a space means finding the optimum set of basis functions *ϕk*(*t*) that spans the output space *f*(*t*; *x*) for any input *x* given to the blackbox. The reduced model of order *r* is then given by

$$f(t; \mathbf{x}) \approx \sum\_{k=1}^{r} z\_k(\mathbf{x}) \, \varrho\_k(t). \tag{9}$$

This approach is known as the Karhunen–Loéve (KL) expansion [18] in case *f*(*t*; *x*) are interpreted as realizations of a random process, or as the functional principal component analysis (FPCA) [19]. For our application, this distinction does not matter. The KL expansion boils down to solving a regression problem in the non-orthogonal basis of *N* observed realizations to represent new observations. Then, an eigenvalue problem is solved to invert the *N* × *N* collocation matrix with entries

$$M\_{\rm ij} = \langle f(t; \mathbf{x}\_{\rm i}), f(t; \mathbf{x}\_{\rm j}) \rangle. \tag{10}$$

Here, the inner product in Hilbert spaces and its approximation for a finite set of support points is given by

$$
\langle u, v \rangle = \int\_{\Omega} u(t)v(t) \, \mathrm{d}t \approx \frac{1}{N\_l} \sum\_{k=1}^{N\_l} u(t\_k)v(t\_k). \tag{11}
$$

If *Nt N* (many support points, few samples), solving the eigenvalue problem of the collocation matrix *M* is more efficient than the dual one of the covariance matrix *<sup>C</sup>* with *Cij* = <sup>∑</sup>*<sup>k</sup> <sup>f</sup>*(*ti*, *<sup>x</sup>k*)*f*(*tj*, *<sup>x</sup>k*) in the usual PCA (see [9] for their equivalence via the singular value decomposition of *Yij* = *f*(*ti*, *xj*)). The question of at which *r* to truncate the eigenspectrum in (9) depends on the desired accuracy in the output, which is briefly analyzed in the following paragraph.

#### *Error Estimate*

Here, we justify why we can assume an *L*<sup>2</sup> truncation error of the order of the ratio *λr*/*λ*<sup>1</sup> between the smallest eigenvalue considered in the approximation and the largest one. The truncated SVD can be shown to be the best linear approximation *M*(*r*) of lower rank *r* to an *N* × *N* matrix *M* in terms of the Frobenius norm ||*M*||<sup>F</sup> (see, e.g., [20]). Its value is simply computed from the *L*<sup>2</sup> norm of singular values,

$$|||M||\_{\mathcal{F}} = \left(\sum\_{k=1}^{N} \sigma\_k^2\right)^{1/2} \text{ \textit{\textsuperscript{\rm N}}} \text{ \textit{\textsuperscript{\rm N}}} \tag{12}$$

where *σ* <sup>2</sup> *<sup>k</sup>* = *λ<sup>k</sup>* in the case of real eigenvalues *λ<sup>k</sup>* of a positive semi-definite matrix as for the covariance or collocation matrix. The truncation error is given by

$$||M^{(r)} - M||\_{\mathcal{F}} = \left(\sum\_{k=r+1}^{N} \lambda\_k\right)^{1/2}.\tag{1.3}$$

The error estimate for the KL expansion uses this convenient property together with the fact that the Frobenius norm is compatible with the usual *<sup>L</sup>*<sup>2</sup> norm <sup>|</sup>*x*<sup>|</sup> of vectors *<sup>y</sup>*, i.e.,

$$|M\underline{y}| \le ||M||\_{\mathbb{F}}|\underline{y}|.\tag{14}$$

Representing *y* via the first *r* eigenvalues of the collocation matrix yields a relative squared reconstruction error of

$$|(M^{(r)} - M)y|^2 / |y|^2 \le \sum\_{k=r+1}^{N} \lambda\_k \le (N - r)\lambda\_r. \tag{15}$$

The last estimate is relatively crude if *N r*, and the spectrum decays fast with the index variable *k*. If one assumes a decay rate *α* with

$$
\lambda\_k \approx \lambda\_l (k - r)^{-a},\tag{16}
$$

one obtains *<sup>N</sup>*

$$\sum\_{k=r+1}^{N} \lambda\_k \approx \sum\_{k=r+1}^{\infty} \lambda\_r (k-r)^{-\alpha} = \lambda\_r \sum\_{k=1}^{\infty} k^{-\alpha} = \lambda\_r \zeta(\alpha),\tag{17}$$

where *ζ* is the Riemann zeta function. This function diverges for a spectral decay of order *α* = 1 and reaches its asymptotic value *ζ*(∞) = 1 relatively quickly for *α* ≥ 2 (e.g., *ζ*(3) = 1.2). The spectral decay rate *α* can be fitted in a log–log plot of *λ<sup>k</sup>* over index *k* and takes values between *α* = 3 and 5 in our use case. The underlying assumptions are violated if the spectrum stagnates at a large number of constant eigenvalues for higher indices *k*.

#### **6. Implementation and Results**

The idea behind the realization of MCMC with a function-valued surrogate is quite simple. Instead of directly using the surrogate for the cost with fixed *θ*, we take a step in-between. Multiple surrogates *z*˜*k*(*x*) are built, where each maps the input *x* to one weight *zk*(*x*) in the KL expansion. A surrogate ˜ *fi*(*x*) <sup>≡</sup> ˜ *f*(*ti*; *x*) for the model output is then given by replacing *zk*(*x*) by *z*˜*k*(*x*) in (9). The according surrogate ˜ (*y*; *x*, *θ*) for the cost function uses ˜ *fi*(*x*) instead of *fi*(*x*) in (8). Dependencies on *θ* are kept exact in this approach. The main algorithm proceeds in the following steps:


For all GP surrogates, we use a Matern 5/2 kernel for *k*(*x*, *x*- ) together with a linear mean model for *m*(*x*), as realized in the Python package GPy [21]. For step 4, we use Gibbs sampling and the surrogate for *z*(*x*), yielding the full output *y*(*t*, *x*) rather than only the *L*<sup>2</sup> distance to a certain reference dataset. The idea to refine the surrogate iteratively during MCMC had to be abandoned early. The problem is that detailed balance is violated as soon as the surrogate proposal probabilities change when modifying the GP regressor with a new point. In the following application cases, we compare a usual MCMC evaluation using the full model to MCMC with delayed acceptance using the GP surrogate together with the KL expansion/functional PCA (GP+KL) in the output function space.

#### *6.1. Toy Model*

First, we test the quality of the algorithm on a toy model given by

$$y(t, \mathbf{x}) = \mathbf{x}\_1 \sin((t - \mathbf{x}\_2)^3). \tag{18}$$

We choose reference values *x*<sup>1</sup> = 1.15, *x*<sup>2</sup> = 1.4 to test the calibration of*x* against the according output *<sup>y</sup>*ref(*t*) <sup>≡</sup> *<sup>y</sup>*(*t*, *<sup>x</sup>*ref) and add Gaussian noise of amplitude *<sup>σ</sup>* <sup>=</sup> 0.05. A flat prior is used for *x*. For the hierarchical model case (7), we choose a starting guess of *θ* = 2 for the norm's order and a Gaussian prior with *σθ* = 0.5 around this value together with a positivity constraint. The initial sampling domain in the square *x*1, *x*<sup>2</sup> ∈ (0, 2). The comparison between MCMC and delayed acceptance MCMC is made once for fixed *θ* = 2 (Gaussian likelihood) and then for a hierarchical model with a random walk also in *θ*. The respective Markov chain with 10,000 steps has a correlation length of ≈ 10 steps (Figure 1) and yields a posterior parameter distribution for (*x*1, *x*2) depicted in Figure 2.

The results in Figure 2 show good agreement in the posterior distributions of full MCMC and delayed acceptance MCMC. Compared to the case with fixed *θ* = 2, the additional freedom in *θ* in the hierarchical model leads to further exploration of the parameter space. The posterior of *θ* according to the Markov chain is given in Figure 3. The similarity to the prior distribution shows that the data does not yield new information on how to choose *θ*.

**Figure 1.** Autocorrelation over lag in MCMC steps for inputs *x*<sup>1</sup> (solid) and *x*<sup>2</sup> (dashed) in the toy model. **Top**: Gaussian likelihood, and **bottom**: hierarchical model. **Left**: full MCMC, and **right**: delayed acceptance MCMC with GP+KL surrogate.

**Figure 2.** Posterior distribution of the calibrated parameters *x* in (18). **Top**: Gaussian likelihood, **bottom**: hierarchical model. **Left**: full MCMC, **right**: delayed acceptance MCMC with GP+KL surrogate.

**Figure 3.** Posterior distribution of the fractional order *θ* in the loss function with *<sup>θ</sup>* norm. **Left**: full MCMC, **right**: delayed acceptance MCMC with GP+KL surrogate.

#### *6.2. Riverine Diatom Model*

The final application of the described method is on a riverine diatom model [22,23]. This model predicts the chlorophyll *a* concentration at an observation point at the Elbe river as a time series and depends on several input parameters. For simplicity, and to limit computational resources, we select only two of the six scalar inputs and use fixed values for the remaining four. Namely, the chosen parameters *x*<sup>1</sup> = *K*light and *x*<sup>2</sup> = *μ*<sup>0</sup> appear in the growth rate inside the diatom model. The latter is given by the "Smith formula" [24] for photosynthesis,

$$
\mu(t) \approx \mu\_0 \frac{1}{D} \int\_0^D \frac{I(t)e^{-\lambda(t)z}}{\sqrt{\mathcal{K}\_{\text{light}}^2 + I^2(t)e^{-2\lambda(t)z}}} d\mathbf{z}\_{\text{at}}
$$

where *D* is the water depth, and *I*(*t*) is the radiation intensity prescribed at the water surface. Light attenuation *λ*(*t*) ≡ *λSC*chl(*t*) is modeled to be proportional to the chlorophyll *a* concentration *C*chl(*t*). Equations are solved within a Lagrangian setup, following water parcels that travel down the Elbe river. Data points of the local chlorophyll time series simulated at Geesthacht Weir are made up by chlorophyll *a* values at the Lagrangian trajectory end points. These values are the functional model output *y*(*t*) for which the model is calibrated with respect to measurements *y*ref(*t*). As the parameters are positive and limited by reasonable maximum values from domain knowledge, we use a half-sided Cauchy (Lorentz) prior

$$p(\mathbf{x}\_k) = \frac{2}{\pi} \frac{b\_k}{b\_k^2 + \mathbf{x}\_k^2} \text{ for } \mathbf{x}\_k > 0, \quad p(\mathbf{x}\_k) = 0 \text{ for } \mathbf{x}\_k \le 0. \tag{19}$$

Here, we choose a scale value *x*∗ *<sup>k</sup>* for which *P*<sup>∗</sup> = 90% of the probability volume is contained within*xk* < *x*<sup>∗</sup> *<sup>k</sup>* . Considering the cumulative distribution, we have to set

$$b\_k = \frac{x\_k^\*}{\tan\left(\frac{\pi}{2}P^\star\right)}\tag{20}$$

to realize this condition.

As in the case of the toy model, we use 10,000 steps in the Markov chain. The results for autocorrelation and posterior samples using the full model versus delayed acceptance are shown in Figures 4 and 5. The correlation time of ≈500 steps is much larger than in the toy model, and the decay of the autocorrelation over the lag roughly matches between the two approaches. Delayed acceptance sampling produces similar posterior samples in Figure 5 at about one third of the overall computation time. There, one also sees the issue of high correlation between *K*light and *μ*<sup>0</sup> in the posterior of the calibration, making Gibbs sampling inefficient in that particular case.

**Figure 4.** Autocorrelation over lag in MCMC steps for inputs *K*light (solid) and *μ*<sup>0</sup> (dashed) in the riverine diatom model. **Top**: Gaussian likelihood, **bottom**: hierarchical model. **Left**: full MCMC, **right**: delayed acceptance MCMC with GP + KL surrogate.

**Figure 5.** Posterior distribution of calibrated parameters for the riverine diatom model. **Left**: full MCMC, **right**: delayed acceptance MCMC with GP + KL surrogate.

#### **7. Conclusions and Outlook**

We illustrated the application of function-valued surrogates to delayed acceptance MCMC for parameter calibration in simple as well as hierarchical Bayesian models. Using a surrogate for the functional output rather than a cost function or likelihood is useful for several reasons. Conceptually, it allows introducing additional distribution parameters in Bayesian hierarchical models. Our results demonstrate that it is possible and efficient to perform MCMC with delayed acceptance on such models while keeping the dependencies in these additional parameters exact. In particular, the fractional order of the norm appearing in the cost function was left free, which is useful for robust model calibration.

The method was applied to a toy model and an application case of a riverine diatom model. In both cases, using delayed acceptance with a surrogate for the functional output produced results comparable to using the full model at only about one third of the actual model evaluations. Compared to direct surrogate modeling of the cost function, we could also observe an increase in the quality of the predicted cost. This is likely connected to the higher flexibility of modeling weights to multiple principal components with Gaussian processes with individual hyperparameters.

The described approach is not immune to the curse of dimensionality. On the one hand, the number of required GP regressors grows linearly with the effective dimensions of the *output* function space. Since evaluation is fast and parallelizable, this is a minor issue in practice. On the other hand, increasing the dimension of the *input* space soon prohibits the construction of a reliable surrogate due to the required number training points to fill the parameter space. In such cases, the preprocessing overhead is expected to outweigh the speedup of delayed acceptance MCMC for either functional or scalar surrogates. More detailed investigations will be required to give quantitative estimates on this trade off.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data available in a publicly accessible repository. The data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.5773865, reference number [25].

**Acknowledgments:** This study is a contribution to the *Reduced Complexity Models* grant number ZT-I-0010 funded by the Helmholtz Association of German Research Centers.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

