*Proceeding Paper* **Global Variance as a Utility Function in Bayesian Optimization †**

**Roland Preuss \* and Udo von Toussaint**

**\*** Correspondence: preuss@ipp.mpg.de

† Presented at the 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, online, 4–9 July 2021.

**Abstract:** A Gaussian-process surrogate model based on already acquired data is employed to approximate an unknown target surface. In order to optimally locate the next function evaluations in parameter space a whole variety of utility functions are at one's disposal. However, good choice of a specific utility or a certain combination of them prepares the fastest way to determine a best surrogate surface or its extremum for lowest amount of additional data possible. In this paper, we propose to consider the global (integrated) variance as an utility function, i.e., to integrate the variance of the surrogate over a finite volume in parameter space. It turns out that this utility not only complements the tool set for fine tuning investigations in a region of interest but expedites the optimization procedure in toto.

**Keywords:** global optimization; Bayesian optimization; utility function; global variance

**PACS:** 02.50.-r; 52.65.-y

#### **1. Introduction**

In many experimental or theoretical approaches the effort of acquiring data may be costly, time consuming or both. The goal is to get insights in either the overall or extremal behaviour of a target quantity with respect to a set of parameters. If insights to functional dependencies between target and parameters are only to be obtained from a computationally expensive function, which may be considered as a black box function, it is instructive to employ surrogate modelling: already acquired data serve as a starting basis for establishing a surrogate surface in parameter space which then gets explored by Bayesian optimization [1]. An overall survey about Bayesian optimization may be found in [2], though it concentrates on an expected improvement (EI) utility and considers noise in the data only in the last paragraph, again by concentrating on EI. In contrast to this nice study we propose to alternate the different utilities at hand. Moreover, a fast information theory related to Bayesian optimization is shown in [3], though this approach approximates any black-box function by a parabolic form which differs from our ansatz letting the black-box function untouched. Interesting insights to multi-objective Bayesian optimization are provided by [4], which considers "multi-objective" in the sense of seeking the extrema–each is free of choice maximum or minimum–of a bunch of single-objective functions. However, the present paper concentrates on finding a common extremum depending on multiple parameters.

For the surrogate modelling we use the Gaussian process method (GP) [5] whose early stages date back to the middle of last century with very first efforts in geosciences [6] tackling the problem of surrogate modelling by so-called kriging [7]. Afterwards, GP has been appreciated much in the fields of neural networks and machine learning [8–12] and further work showed the applicability of active data selection via variance based

**Citation:** Preuss, R.; von Toussaint, U. Global Variance as a Utility Function in Bayesian Optimization. *Phys. Sci. Forum* **2021**, *3*, 3. https:// doi.org/10.3390/psf2021003003

Academic Editor: Sascha Ranftl

Published: 5 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Max–Planck–Institut für Plasmaphysik, 85748 Garching, Germany; udt@ipp.mpg.de

criterions [13,14]. Our implementation of the GP method in this paper was already introduced at [15], and follows in notation–and apart from small amendments—the very instructive book of Rasmussen & Williams [5].

While in a previous work [16] we investigated the performance of utility functions for the expected improvement of an additional data point or for a data point with the maximal variance, in this paper we would like to introduce the global variance, i.e., the integral over the variance for a target surrogate within a region of interest with respect to a newly added data point. It is the substantial advantage of the Gaussian process method that such a task may be tackled simply on the basis of already acquired data, i.e., before new data have to be determined.

#### **2. Global Variance for Gaussian Process-Based Model**

In the following we concisely report the formulas leading to the results in this paper. For a thorough discussion of Gaussian processes please refer to the above mentioned papers, especially to the book of Rasmussen & Williams [5].

The problem of predicting function values in a multi-dimensional space supported by given data is a regression problem for a non-trivial function of unknown shape. Given are *n* target data *y* for input data vectors *x<sup>i</sup>* of dimension *N*dim with matrix *X* = (*x*1, *x*2, ..., *xn*) written as

$$\mathbf{y} = \begin{pmatrix} y\_1 \\ y\_2 \\ \vdots \\ \vdots \\ y\_n \end{pmatrix}, \quad \mathbf{x}\_i = \begin{pmatrix} \mathbf{x}\_{i1} \\ \mathbf{x}\_{i2} \\ \vdots \\ \mathbf{x}\_{iN\_{\text{dim}}} \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} \mathbf{x}\_{11} & \mathbf{x}\_{21} & \dots & \mathbf{x}\_{n1} \\ \mathbf{x}\_{12} & \mathbf{x}\_{22} & \dots & \mathbf{x}\_{n2} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}\_{1N\_{\text{dim}}} & \mathbf{x}\_{2N\_{\text{dim}}} & \dots & \mathbf{x}\_{nN\_{\text{dim}}} \end{pmatrix}. \tag{1}$$

We assume that target data *yi* are blurred by Gaussian noise with *σ*<sup>2</sup> *di* . Further, we assume that the black box function interconnecting input *X* and target *y* is at least uniformly continuous and thereby justifies a description of a target surface with a surrogate from the Gaussian process method. Despite the experimental data and the physics background all quantities throughout this paper are without units.

The decisive quantity of a Gaussian process is the covariance function *k* describing the distance between two vectors *x<sup>p</sup>* and *x<sup>q</sup>* defined by

$$k(\mathbf{x}\_{\mathcal{P}}, \mathbf{x}\_{\mathcal{q}}) = \sigma\_f^2 \exp\left\{-\frac{1}{2} \left|\frac{\mathbf{x}\_p - \mathbf{x}\_q}{\lambda}\right|^2\right\}.\tag{2}$$

with the signal variance *σ*<sup>2</sup> *<sup>f</sup>* and length scale *<sup>λ</sup>*. A covariance matrix (*K*)*ij* <sup>=</sup> *<sup>k</sup>*(*xi*, *<sup>x</sup>j*) considers the covariances of all input data *<sup>X</sup>*. The GP method describes a target value *<sup>f</sup>*<sup>∗</sup> at test input vector *<sup>x</sup>*<sup>∗</sup> by a normal distribution *<sup>p</sup>*(*f*∗|*X*, *<sup>y</sup>*, *<sup>x</sup>*∗) <sup>∝</sup> <sup>N</sup> ¯ *f*∗, var(*x*∗) with mean ¯ *<sup>f</sup>*<sup>∗</sup> <sup>=</sup> *<sup>k</sup><sup>T</sup>* ∗ *K* + *σ*<sup>2</sup> *n*Δ −1 *<sup>y</sup>*, and variance var(*x*∗) = *<sup>k</sup>*(*x*∗, *<sup>x</sup>*∗) <sup>−</sup> *<sup>k</sup><sup>T</sup>* ∗ *K* + *σ*<sup>2</sup> *n*Δ −1 *<sup>k</sup>*∗, where the term *σ*<sup>2</sup> *<sup>n</sup>*Δ represents the degree of information in the data. While Δ is the diagonal matrix of the given variances *σ*<sup>2</sup> *di* , the variance *σ*<sup>2</sup> *<sup>n</sup>* accounts for an overall noise in the data. Then the full covariance matrix *M* of the Gaussian Process is

$$\mathbf{M} = \mathbf{K} + \sigma\_n^2 \Delta \tag{3}$$

In Bayesian probability theory the three parameters *θ* = (*λ*, *σ<sup>f</sup>* , *σn*)*<sup>T</sup>* are considered to be hyper-parameters which show up in the marginal likelihood as

$$\log p(y|\theta) = \text{const} - \frac{1}{2} y^T \left[ \mathbf{K}(\lambda, \sigma\_f) + \sigma\_n^2 \Delta \right]^{-1} y - \frac{1}{2} \log \left| \mathbf{K}(\lambda, \sigma\_f) + \sigma\_n^2 \Delta \right|. \tag{4}$$

In [16], we showed, that for a sufficiently large data base the target surrogate is well described by using the expectation values of the hyper-parameters in the formulas for *f*<sup>∗</sup> and var(*f*∗), at least well enough to determine a global optimum in a region of interest (RoI). The global optimum is found by employing utility functions, as there are the expected improvement *<sup>U</sup>*EI(*x*∗) = *<sup>I</sup>* <sup>=</sup> <sup>∞</sup> *<sup>f</sup>*max *<sup>f</sup>*<sup>∗</sup> *<sup>p</sup>*(*f*∗|*X*, *<sup>y</sup>*, *<sup>x</sup>*∗)d*f*<sup>∗</sup> and the maximum variance *<sup>U</sup>*MV(*x*∗) = var(*f*∗). For both the respective maximum at *<sup>x</sup>*∗max <sup>=</sup> arg max{*x*∗} *<sup>U</sup>*EI/MV is sought. While the first one (*U*EI) evaluates the possible information gain from a new data point, the second utility (*U*MV) simply estimates the vector in input space with largest variance in the target surrogate.

In order to have a look on the implications of an additional data point in the surrogate, we propose a further utility function, i.e., the global variance defined on the multi-dimensional RoI∈ [*−***1** : **1**] by

$$
\mathcal{U}\_{\rm GV} = \int\_{-\mathbf{1}}^{\mathbf{1}} \mathbf{var}(\mathbf{x}) \mathbf{dx} \,. \tag{5}
$$

The exact integration shown in the Appendix A leads to

$$\begin{array}{rcl} \text{PL}^{\text{exact}}\_{\text{GV}} &=& 2^{N\_{\text{dim}}} \sigma\_f^2 - \sigma\_f^4 \left(\frac{\sqrt{\pi}\lambda}{2}\right)^{N\_{\text{dim}}} \sum\_{ij}^n \left(\mathbf{M}^{-1}\right)\_{ij} \\\\ & \cdot \prod\_k^{N\_{\text{dim}}} \left\{ \text{erf}\left[\frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda}\right] - \text{erf}\left[-\frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda}\right] \right\} \end{array} \tag{6}$$

Though Equation (6) represents the correct result, it may turn out in computation runs that the determination of the error-function is substantially time consuming compared to the total expenditure. Therefore, we would like to introduce two alternatives to the exact integration in Equation (5).

The first one is kind of an approximation. Since outside of the RoI the integrand in Equation (5) shows only trivial contributions we shift the upper and lower integral bounds to ± infinity and get from the simple Gaussian integrals

$$\mathcal{U}\mathcal{U}\_{\rm GV}^{\rm inf} \approx -\sigma\_f^4 \left(\sqrt{\pi}\lambda\right)^{N\_{\rm dim}} \sum\_{ij}^n \left(\mathcal{M}^{-1}\right)\_{ij} \exp\left\{-\frac{1}{4\lambda^2} \left(\mathbf{x}\_i - \mathbf{x}\_j\right)^T \left(\mathbf{x}\_i - \mathbf{x}\_j\right)\right\}.\tag{7}$$

We dropped the first term in the integral over [*σ*<sup>2</sup> *f x*] ∞ <sup>−</sup><sup>∞</sup> for being infinity, since at least it is a constant contribution regardless of changes in the input *X*. Although the utility *U*GVinf in Equation (7) is an approximation only, it has the advantage of being much easier accessible by numerical means and its computation performs much faster compared to Equation (6).

A second much more sophisticated approach is to insert an enveloping Gaussian function with adjustable location *x<sup>G</sup>* (guiding center) and variance *σ<sup>G</sup>* in the integral of Equation (5). Again the integration limits are shifted to ± infinity, however this time the enveloping Gaussian function takes care of the integrability and we get

$$\begin{split} \mathcal{U}\_{\text{GV}}^{\text{env}} &= \int\_{-\infty}^{\infty} \text{var}(\mathbf{x}) \left( \frac{1}{\sqrt{2\pi\sigma\_{G}^{2}}} \right)^{N\_{\text{dim}}} \exp\left[ \frac{1}{2\sigma\_{G}^{2}} (\mathbf{x} - \mathbf{x}\_{G})^{T} (\mathbf{x} - \mathbf{x}\_{G}) \right] d\mathbf{x} \\ &= \left. \sigma\_{f}^{2} - \sigma\_{f}^{4} \left( \frac{\lambda}{\sqrt{2\sigma\_{G}^{2} + \lambda^{2}}} \right)^{N\_{\text{dim}}} \sum\_{ij}^{n} (\mathbf{M}^{-1})\_{ij} \right. \\ & \quad \cdot \exp\left\{ -\frac{1}{2} \left[ \frac{\mathbf{x}\_{i}^{T}\mathbf{x}\_{i} + \mathbf{x}\_{j}^{T}\mathbf{x}\_{j}}{\lambda^{2}} + \frac{\mathbf{x}\_{G}^{T}\mathbf{x}\_{G}}{\sigma\_{G}^{2}} - \frac{\left( \frac{\mathbf{x}\_{i} + \mathbf{x}\_{i}}{\lambda^{2}} + \frac{\mathbf{x}\_{G}}{\sigma\_{G}^{2}} \right)^{T} \left( \frac{\mathbf{x}\_{i} + \mathbf{x}\_{j}}{\lambda^{2}} + \frac{\mathbf{x}\_{G}}{\sigma\_{G}^{2}} \right)}{\frac{2}{\lambda^{2}} + \frac{1}{\sigma\_{G}^{2}}} \right] \right\}. \end{split} \tag{8}$$

The two parameters *x<sup>G</sup>* and *σ<sup>G</sup>* provide a toolset to guide the search for the next target data evaluations: a smaller standard deviation *σ<sup>G</sup>* shifts the attention to the center of the enveloping Gaussian, while *x<sup>G</sup>* gives us the possibility to focus on certain regions in the RoI.

All three utilities employing the global variance, *U*exact GV , *<sup>U</sup>*inf GV, *<sup>U</sup>*env GV require the inversion of the full covariance matrix *M* of Equation (3). Since the inversion has to be performed for every newly proposed test input *<sup>x</sup>*<sup>∗</sup> this is the main time consuming part in the whole procedure. Let us remind the reader, that the method we are proposing fully resides on input space (together with already acquired data) and the bottle neck is the generation of new target data. Therefore, the starting condition of very expensive (aka time consuming) data acquisition still holds. However, we can beneficially use blockwise matrix inversion [17] since a new input vector *xn*+<sup>1</sup> expands the covariance matrix for one additional row and line only. Consequently, we reduced the computational effort to *n*2-behaviour instead of *n*<sup>3</sup> for standard inversion.

#### **3. Proof of Principle**

We follow the global optimization scheme from Section 4 of [18]. Again we give proof of principle with a "black box" model featuring a broad parabolic maximum 2 <sup>−</sup> <sup>∑</sup>*Ndim <sup>k</sup> <sup>x</sup>*<sup>2</sup> *<sup>k</sup>* + (−1)*k*0.3 together with a smaller cosine structure 0.1cos[2*π*(*xk* <sup>−</sup> 0.3)/Δ*cos*] on top of it, while we focus on a decent ripple on Δ*cos*=0.6 in one and two dimensions (*Ndim*=1, 2).

Figures 1–3 show in left and right panels the results for one and two dimensions, respectively. The *x* axis to the right counts the number of newly acquired data for the utility comparisons in Figure 3 and in the bottom rows (**c**), (**d**) of Figures 1 and 2.

For every newly added data point proposed by the various utilities, the distance between the true location of the global optimum and the maximal value of the surrogate residing on the data at hand is calculated in Figure 1. In a similar fashion, the search for the best surrogate description of the hidden model is shown in Figure 2.

Eventually Figure 3 demonstrates the use of an enveloping Gaussian function in the integral of the global variance by varying its center *xG*, e.g., if an educated guess about the location of the extremal structure is at hand, i.e., the guiding center *xG* is preset to the positive axis (1d) or the quadrant (2d) with the true model maximum. Consequently, Table 1 displays for 1d the specific number of data and for 2d the saturation level for which the target surrogate enters the stage of resembling the true model, i.e. the summed up (absolute) differences between all grid points of the target surrogate and the model starts to diminish with the number of target data only.


**Table 1.** Comparison of enveloping Gaussian utilities with different integral weights and guiding centers in finding the best surrogate. **1d**: Changing step to solution. **2d**: Saturation level of the solution.

**Figure 1.** (**a**,**b**): One- and two-dimensional model with target data (full circles). The square in the bottom line/surface represents the true maximum. On the left the gray shaded area represents the uncertainty region of the surrogate (full line) from using the expected improvement utility only. On the right the points in the bottom surface are input data. Full circles represent additional data proposed by combination of all three utilities. (**c**,**d**): Distance surrogate/true maximum for different utilities employed in the global optimization procedure.

**Figure 2.** (**a**,**b**): One- and two-dimensional surrogate with newly acquired data. Surrogate solution (full line) on the left from using combination of *U*MV and *U*GV. Surrogate surface on the right from employing *U*GV only. (**c**,**d**): Comparison of the differences between surrogate and true model integrated over RoI for various utilities.

**Figure 3.** Summation over difference between grid points of target surrogate and true model as function of additionally acquired data. (**a**,**b**): One and two-dimensional case for enveloping Gaussian utility with various weights and guiding center at origin and at (0.5) or (0.5; −0.5).

#### **4. Discussion**

The results above show the usage of various utilities as a toolbox for surrogate modeling. Depending on the task—either to find an extremum or to get a best surrogate description of an unknown "black box" model—and depending on the prior knowledge at hand—presumption of location of the sought extremum or concentration on the region of interest—it is advisable to choose the most eligible utility function. However, even more promising is the combination of utilities of different character to profit from their benefits in toto and to compensate for pitfalls and drawbacks of one or the other utility.

As can be seen in the very first example in Figure 1 for the one-dimensional case the global optimum is found very fast with help of the expected improvement utility *UEI* (starting to enter the bump with the correct extremum already below *N* = 10). However, a known drawback of this utility is that it gets stuck in local extrema and that it takes an unreasonably high number of additional data to get distracted from this pitfall.

This is taken into account for the two-dimensional case where the best result with lowest difference to the exact result is obtained by acting in combination of all three utilities *U*EI, *U*MV and *U*GV. Focusing the maximum search on the utility regarding expected improvement alone (black line in Figure 1d would have got stuck in a local extremum with *y* = 2.03 in the "wrong" quadrant at (−0.26; −0.29) for not recovering from this at all at about N = 63 (internal stop of algorithm for no improvement after entering computing accuracy level) and totally missing the true optimum with *y* = 2.2 at (0.3; −0.3).

The situation changes for the task of getting a best overall description within the region of interest. To accomplish this the newly introduced global variance utility *U*GV is of tremendous help both in one and two dimensions—either alone or in combination with at least the maximum variance utility *U*MV. As shown in Figure 2d the best surrogate can already be established around ninety data points by employing *U*GV only (full circles in the target surface of Figure 2b, with very few deviations from the true model left.

A guess about the approximate occurrence of an extremal structure–without excluding another region—can be emphasized by a further refinement to the global variance utility. In letting act an enveloping Gaussian within the global variance integral Equation (5) the result is not only much easier to be tackled from a computational point of view, but also the focus of the numerical search for the global optimum can be guided by predetermining the center of Gaussian *xG* and its integral weight (aka width *σxG* ).

Figure 3 shows the results for three different integral weights (0.6; 0.8; 0.9) of the enveloping Gaussian function at two guiding centers: the first one at the origin corresponds to an ignorant scenario where one is not sure about a certain position of some global optimum at all. In the second approach we suppose that the extremal structure may be found in one dimension for positive values and thereby set *xG* = 0.5, while in two dimensions it may be located within the quadrant with positive values for *x*<sup>1</sup> and negative ones for *x*<sup>2</sup> resulting in *xG* = [0.5; −0.5]. As can be seen already in Figure 3, but all the more learned from the numbers of Table 1, displacing the center of the enveloping Gaussian function to the real center of the optimum of the hidden model facilitates the development of a best—regarding similarity to the true model—surrogate surface.

**Author Contributions:** The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Global Variance: Derivation of the Exact Integration**

The variance at some (test) point *<sup>x</sup><sup>T</sup>* = (*x*1, *<sup>x</sup>*2, ... , *xNdim* ) in a region confined to [*−***1**, **<sup>1</sup>**] of dimension *Ndim* is

$$\text{var}(\mathbf{x}) = k(\mathbf{x}, \mathbf{x}) - \mathbf{k}^T (\mathbf{K} + \sigma\_n^2 \Delta)^{-1} \mathbf{k} \quad . \tag{A1}$$

The covariance *k*(*xp*, *xq*) between pairs of input variables (*xp*, *xq*) is defined by

$$k(\mathbf{x}\_p, \mathbf{x}\_q) = \sigma\_f^2 \exp\left[ -\frac{(\mathbf{x}\_p - \mathbf{x}\_q)^T (\mathbf{x}\_p - \mathbf{x}\_q)}{2\lambda^2} \right] \quad . \tag{A2}$$

While the first term in Equation (A1) is simply *k*(*x*, *x*) = *σ*<sup>2</sup> *<sup>f</sup>* , we need for the second term

$$\mathbf{k} = \sigma\_f^2 \begin{pmatrix} \exp\left[ -\frac{1}{2\lambda^2} (\mathbf{x} - \mathbf{x}\_1)^T (\mathbf{x} - \mathbf{x}\_1) \right] \\ \exp\left[ -\frac{1}{2\lambda^2} (\mathbf{x} - \mathbf{x}\_1)^T (\mathbf{x} - \mathbf{x}\_2) \right] \\ \vdots \\ \exp\left[ -\frac{1}{2\lambda^2} (\mathbf{x} - \mathbf{x}\_1)^T (\mathbf{x} - \mathbf{x}\_N) \right] \end{pmatrix} \tag{A3}$$

and the inversion of the matrix *M* = *K* + *σ*<sup>2</sup> *<sup>n</sup>*Δ, where the matrix elements are Δ*ii* = *σdi* (Δ*ij* <sup>=</sup> 0 for *<sup>i</sup>* <sup>=</sup> *<sup>j</sup>*) and *Kij* <sup>=</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>f</sup>* exp <sup>−</sup> <sup>1</sup> <sup>2</sup>*λ*<sup>2</sup> (*x<sup>i</sup>* <sup>−</sup> *<sup>x</sup>j*)*T*(*x<sup>i</sup>* <sup>−</sup> *<sup>x</sup>j*) . For a given set of hyperparameters the matrix *M* does not depend on the test vector *x* and may be treated as a constant in integration over d*x*. Therefore, after the inversion has been performed, *M*−<sup>1</sup> can easily be regarded as a pure number. So the second term in Equation (A1) is just a sum over all components with indices {*i*, *j*} ∈ (1, . . . , *N*),

$$k\_\*^T (\mathcal{K} + \sigma\_n^2 \Delta)^{-1} \mathbf{k}\_\* = \sum\_{i,j=1}^N k\_{\*i} \left( \mathcal{M}^{-1} \right)\_{ij} k\_{\*j} \tag{A4}$$

$$=\sigma\_f^4 \sum\_{i,j=1}^N \left(\mathbf{M}^{-1}\right)\_{ij} \varepsilon^{-\frac{(\mathbf{X}-\mathbf{X}\_j)^T(\mathbf{X}-\mathbf{X}\_i)}{2\lambda^2}} \mathcal{e}^{-\frac{(\mathbf{X}-\mathbf{X}\_j)^T(\mathbf{X}-\mathbf{X}\_j)}{2\lambda^2}}.\tag{A5}$$

Further let us concentrate on the terms in the nominator of the exponential:

$$\mathbf{x}^{\top}(\mathbf{x} - \mathbf{x}\_{i})^{T}(\mathbf{x} - \mathbf{x}\_{i}) + (\mathbf{x} - \mathbf{x}\_{j})^{T}(\mathbf{x} - \mathbf{x}\_{j}) = 2\left[-\mathbf{x}^{T}\frac{\mathbf{x}\_{i} + \mathbf{x}\_{j}}{2} - \frac{\mathbf{x}\_{i}^{T} + \mathbf{x}\_{j}^{T}}{2}\mathbf{x}\right] + \mathbf{x}\_{i}\mathbf{x}\_{i}^{T} + \mathbf{x}\_{j}\mathbf{x}\_{j}^{T} \tag{A6}$$

Completing the square gives

$$2\left[\left(\mathbf{x} - \frac{\mathbf{x}\_{i} + \mathbf{x}\_{j}}{2}\right)^{T}\left(\mathbf{x} - \frac{\mathbf{x}\_{i} + \mathbf{x}\_{j}}{2}\right)\right] + \frac{1}{2}\left(\mathbf{x}\_{i} - \mathbf{x}\_{j}\right)^{T}\left(\mathbf{x}\_{i} - \mathbf{x}\_{j}\right) \,. \tag{A7}$$

We insert Equation (A7) in Equation (A5) and finally get for the variance

$$\text{var}(\mathbf{x}) = \sigma\_f^2 - \sigma\_f^4 \sum\_{i,j=1}^N \left(\mathbf{M}^{-1}\right)\_{ij} e^{-\frac{1}{4\lambda^2} \left(\mathbf{x}\_i - \mathbf{x}\_j\right)^T \left(\mathbf{x}\_i - \mathbf{x}\_j\right)\_{\mathcal{E}}} e^{-\frac{1}{\lambda^2} \left[\left(\mathbf{x} - \frac{\mathbf{x}\_j + \mathbf{x}\_j}{2}\right)^T \left(\mathbf{x} - \frac{\mathbf{x}\_j + \mathbf{x}\_j}{2}\right)\right]} \dots \tag{A8}$$

Only the second exponential in Equation (A8) depends on *x* and therefore needs to be considered in the integral of the global variance:

$$\int\_{-1}^{1} \mathbf{dx} \,\mathrm{var}(\mathbf{x}) \,. \tag{A9}$$

We insert Equation (A8) into Equation (A9) and let the integral stay only for the term with *x* dependency:

$$\begin{split} \int\_{-\mathbf{1}}^{\mathbf{1}} \mathbf{dx} \,\mathrm{var}(\mathbf{x}) &= \quad 2^{N\_D} \sigma\_f^2 - \sigma\_f^4 \sum\_{i,j=1}^N \left( \mathbf{M}^{-1} \right)\_{ij} e^{-\frac{1}{4\lambda^2} \left( \mathbf{x}\_i - \mathbf{x}\_j \right)^T \left( \mathbf{x}\_i - \mathbf{x}\_j \right)} \\ &\quad \cdot \int\_{-\mathbf{1}}^{\mathbf{1}} \mathbf{dx} \, e^{-\frac{1}{\lambda^2} \left[ \left( \mathbf{x} - \frac{\mathbf{x}\_i + \mathbf{x}\_j}{2} \right)^T \left( \mathbf{x} - \frac{\mathbf{x}\_i + \mathbf{x}\_j}{2} \right) \right]} . \end{split} \tag{A10}$$

Since the term in the exponential is quadratic it separates into a sum, which itself facilitates the separation of the integral into each dimension. Being simplified to a number of *Nd* one-dimensional integrals they can easily be solved by employing the error function. To prove this, let us have a closer look at the integral only:

$$\int\_{-\mathbf{1}}^{\mathbf{1}} \mathbf{d} \mathbf{x} \, e^{-\frac{1}{\lambda^2} \left[ \left( \mathbf{x} - \frac{\mathbf{x}\_i + \mathbf{x}\_j}{2} \right)^T \left( \mathbf{x} - \frac{\mathbf{x}\_i + \mathbf{x}\_j}{2} \right) \right]} = \prod\_{k=1}^{N\_{dim}} \int\_{-1}^{1} \mathbf{d} \mathbf{x} \, e^{-\frac{1}{\lambda^2} \left[ \left( \mathbf{x}\_k - \frac{x\_{ik} + x\_{jk}}{2} \right)^2 \right]} . \tag{A11}$$

Focusing on a the *<sup>k</sup>*th integral and substituting *<sup>τ</sup><sup>k</sup>* = (*xk* <sup>−</sup> *xik*+*xjk* <sup>2</sup> )/*λ* some error functions evolve to end up finally in:

$$\int\_{-1}^{1} \mathbf{d}x\_k e^{-\frac{1}{\lambda^2} \left[ \left( x\_k - \frac{x\_{ik} + x\_{jk}}{2} \right)^2 \right]} = \lambda \int\_{\frac{1}{\lambda} - \frac{x\_{ik} + x\_{jk}}{2\lambda}}^{-\frac{1}{\lambda} - \frac{x\_{ik} + x\_{jk}}{2\lambda}} \mathbf{d}x\_k e^{-\frac{\pi^2}{4}} \tag{A12}$$

$$\Gamma = \lambda \left[ \int\_{\frac{1}{\lambda} - \frac{x\_{ik} + x\_{jk}}{2\lambda}}^{0} \mathrm{d}\tau\_k e^{-\tau\_k^2} + \int\_0^{-\frac{1}{\lambda} - \frac{x\_{ik} + x\_{jk}}{2\lambda}} \mathrm{d}\tau\_k e^{-\tau\_k^2} \right] \tag{A13}$$

$$\hat{\lambda} = \frac{\sqrt{\pi}}{2} \lambda \left\{ \text{erf} \left[ \frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda} \right] - \text{erf} \left[ -\frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda} \right] \right\}. \tag{A14}$$

This concludes the study. Simply inserting Equation (A14) into Equation (A10) succeeds in the result reported in the paper:

$$\int\_{-\mathbf{1}}^{\mathbf{1}} \mathbf{dx} \operatorname{var}(\mathbf{x}) \quad = \ 2^{N\_D} r\_f^2 - \sigma\_f^4 \left(\frac{\sqrt{\pi}}{2} \lambda\right)^{N\_{\rm{dim}}} \sum\_{i,j=1}^N \left(\mathbf{M}^{-1}\right)\_{ij} e^{-\frac{1}{4\lambda^2} \left(\mathbf{x}\_i - \mathbf{x}\_j\right)^T \left(\mathbf{x}\_i - \mathbf{x}\_j\right)}$$

$$\cdot \quad \prod\_k^{N\_{\rm{div}}} \left\{ \operatorname{erf}\left[\frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda} \right] - \operatorname{erf}\left[-\frac{1}{\lambda} - \frac{\mathbf{x}\_{ik} + \mathbf{x}\_{jk}}{2\lambda} \right] \right\}. \tag{A15}$$

#### **References**


### *Proceeding Paper* **Bayesian Surrogate Analysis and Uncertainty Propagation †**

**Sascha Ranftl \* and Wolfgang von der Linden**

Institute of Theoretical Physics-Computational Physics, Graz University of Technology, Petersgasse 16, 8010 Graz, Austria; vonderlinden@tugraz.at

**\*** Correspondence: ranftl@tugraz.at

† Presented at the 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, online, 4–9 July 2021.

**Abstract:** The quantification of uncertainties of computer simulations due to input parameter uncertainties is paramount to assess a model's credibility. For computationally expensive simulations, this is often feasible only via surrogate models that are learned from a small set of simulation samples. The surrogate models are commonly chosen and deemed trustworthy based on heuristic measures, and substituted for the simulation in order to approximately propagate the simulation input uncertainties to the simulation output. In the process, the contribution of the uncertainties of the surrogate itself to the simulation output uncertainties is usually neglected. In this work, we specifically address the case of doubtful surrogate trustworthiness, i.e., non-negligible surrogate uncertainties. We find that Bayesian probability theory yields a natural measure of surrogate trustworthiness, and that surrogate uncertainties can easily be included in simulation output uncertainties. For a Gaussian likelihood for the simulation data, with unknown surrogate variance and given a generalized linear surrogate model, the resulting formulas reduce to simple matrix multiplications. The framework contains Polynomial Chaos Expansions as a special case, and is easily extended to Gaussian Process Regression. Additionally, we show a simple way to implicitly include spatio-temporal correlations. Lastly, we demonstrate a numerical example where surrogate uncertainties are in part negligible and in part non-negligible.

**Keywords:** uncertainty quantification; uncertainty propagation; surrogate models; meta-modeling; Bayesian

#### **1. Introduction**

Uncertainty quantification of simulations has gained increasing attention, e.g., in the field of Computational Engineering, in order to address doubtful parameter choices and assess the models' credibility. Surrogate models have become a popular tool to propagate simulation input uncertainties to the simulation output, particularly for modern day applications with high computational cost and many uncertain model parameters. For that, a parametrized surrogate model (synonyms: meta-model, emulator) is learned from a finite set of simulation samples. i.e., the surrogate is a function of the uncertain simulation input parameters that is 'fitted' to the simulation output data. The quality of this fit is then judged by heuristic diagnostics, and the surrogate deemed trustworthy, respectively. A key aspect of this procedure is that the surrogate can be evaluated much faster than the simulation, and still retains a reasonable approximation to the simulation.

The simulation is then substituted with the surrogate model in order to compute the marginal probability density function of the simulation output. The simulation uncertainties are thus inferred from the surrogate model instead of the original simulation model at a significantly reduced computational effort. While this practice allows for obtaining estimates on uncertainties of expensive simulations in the first place, the contribution of the uncertainty of the surrogate itself to the total simulation uncertainty is commonly neglected. In other words, the estimation of the surrogate parameters based on the finite set

**Citation:** Ranftl, S.; von der Linden, W. Bayesian Surrogate Analysis and Uncertainty Propagation. *Phys. Sci. Forum* **2021**, *3*, 6. https://doi.org/ 10.3390/psf2021003006

Academic Editor: MaxEnt 2021 Scientific Committee

Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of simulation samples entails an additional uncertainty in the sought-for uncertainty of the simulation output. The purpose of this paper is to investigate this surrogate uncertainty as the natural measure for the surrogate's trustworthiness, and how the surrogate uncertainty affects the simulation output uncertainty.

In many cases, the surrogate uncertainty is indeed small if the heuristic diagnostics naively imply so. If the heuristic diagnostics imply that the surrogate is not trustworthy, one may resort to two options: (i) Acquire more simulation data until the surrogate is trustworthy. This is limited by the computational budget and the surrogate's convergence properties. (ii) Shrink the parameter space, e.g., omit a number of uncertain simulation parameters by assuming definite parameter values. However, in some cases, (i) is not feasible and (ii) is not desired. In this contribution, we demonstrate how to include surrogate uncertainties if the user deals with a surrogate model with doubtful trustworthiness.

Popular surrogate models are Polynomial Chaos Expansions [1–3] and Gaussian Process Regression [4,5], the latter of which has had its renaissance recently from within the machine learning community. In this work, we assume a Gaussian likelihood for the simulation data with unknown variance and given a generalized linear surrogate model (i.e., linear in the surrogate parameters) that includes Polynomial Chaos Expansions as a special case and is easily extended to Gaussian Process Regression. Other Bayesian perspectives on Uncertainty Quantification of computer simulations with these popular surrogate models are given in [6–13]. A comprehensive collection of reviews on Uncertainty Quantification, from the point of view of computational engineering and applied mathematics, can be found in [14]. In [7,15], a statistician's perspective is discussed. Here, we will use Bayesian Probability Theory [16].

#### **2. Bayesian Uncertainty Quantification**

We start with the general structure of uncertainty propagation problems based on surrogate models in Section 2.1. In Section 2.2, we analyze a generalized linear surrogate model with a Gaussian likelihood for the simulation data with unknown surrogate variance. In Section 2.3, we proceed to use the surrogate model to propagate the input uncertainties to the output, and show how the surrogate's uncertainties too can be included.

#### *2.1. General Structure of the Problem*

The goal in this paper is to quantify the uncertainties of the simulation results for the observable *z*(*x*) at different measurement points *x* = 1, ..., *Nx* in the simulation domain. e.g., *z* could be the mechanical stress resulting from a structural analysis with a finite element simulation, where *x* could denote the location of the measurement probe in or on the analyzed structure. *<sup>z</sup>*(*x*) depends on unknown or uncertain model parameters *<sup>a</sup>* <sup>=</sup> {*ai*}*Na <sup>i</sup>*=1, which are generally inferred from experimental data *d*exp. Based on these data, Bayes' theorem allows for determining the posterior probability density function (pdf) for *a*,

$$p\left(a \mid d\_{\text{exp}}, \mathcal{T}\right) \,,\tag{1}$$

where all background information on the experiment is subsumed in I. The implications of the background information I will be discussed later. This pdf will be assumed to be (almost) arbitrary but given in the following considerations. It usually is the result of a statistical data analysis of the foregoing experiment. This experiment could be the measurement of some material property needed for the simulation, e.g., viscosity for a computational fluid dynamics simulation. The uncertainty of the model parameters *a* entails an uncertainty in the simulated observable *z*(*x*), and the latter is determined by the marginalization rule,

$$p\left(z^{(\mathbf{x})} \mid d\_{\text{exp}}, \mathcal{Z}\right) = \int p\left(z^{(\mathbf{x})} \mid \mathbf{a}, \mathbf{j}\_{\text{eff}\_{\mathbf{P}'}} \mathcal{Z}\right) p\left(\mathbf{a} \mid d\_{\text{exp}}, \mathcal{Z}\right) dV\_{\mathbf{a}} \,. \tag{2}$$

In the first pdf, we have struck out *d*exp because the knowledge of *a* suffices to perform the simulation to obtain *z*(*x*). If *a* consists of only one or two parameters (*Na* = 1, 2), then the numerical evaluation of the integral over the model parameters *a* will typically require a few dozens to hundreds of simulations. The uncertainty propagation is then done and no surrogate is needed. However, this is the trivial case, and usually *a* will consist of way more parameters. Let us assume that *a* consists of, e.g., four parameters. That would imply the need for performing simulations at least 105 times, which is way too CPU expensive for most real problems. This can be avoided if the simulations are replaced by a surrogate model that approximates the observable *z* by a suitable parametrized surrogate function *<sup>z</sup>*sur <sup>=</sup> *<sup>g</sup>*(*a*|*c*), where *<sup>c</sup>* are yet unknown parameters. The simulation may yield the observable *z*(*x*) at different sites *x* in the domain; however, *x* could also denote the time-instance in non-static problems. Clearly, the parameters of the surrogate model will also depend on those positions. Thus, we actually have

$$z^{(x)} \approx z^{(x)}\_{\text{sur}} = \mathcal{g}(\mathfrak{a}|\mathfrak{c}^{(x)})\,. \tag{3}$$

The unknown parameters will be inferred from a suitable training data. To this end, simulations are performed for a finite set of model parameters *<sup>A</sup><sup>s</sup>* <sup>=</sup> {*a*(*i*) *<sup>s</sup>* }*Ns <sup>i</sup>*=<sup>1</sup> and the corresponding observables *<sup>Z</sup><sup>s</sup>* <sup>=</sup> {*<sup>z</sup>* (*x*),(*i*) *<sup>s</sup>* }*Ns*,*Nx <sup>i</sup>*,*x*=<sup>1</sup> are computed and combined in *<sup>D</sup>*sim <sup>=</sup> {*As*, *<sup>Z</sup>s*}. The surrogate parameters *<sup>c</sup>*(*x*) are then inferred from *<sup>D</sup>*sim, and the surrogate is thus constructed. We now proceed to substitute the simulation for the surrogate, *<sup>z</sup>*(*x*) <sup>→</sup> *z* (*x*) sur, in order to solve Equation (2) at a significantly reduced computational cost. This implies that the background information has changed. We will denote this as <sup>I</sup>˜, suggesting that we take the observable *z* entering the integral in Equation (2) from the surrogate model Equation (3) rather than from the expensive simulation. More precisely, instead of Equation (2), we now need to consider

$$p\left(z^{\left(\mathbf{x}\right)}\mid d\_{\mathrm{exp}}, \mathbf{D}\_{\mathrm{sim}}, \tilde{\mathcal{L}}\right) = \int dV\_{\mathbf{a}} \, p\left(z^{\left(\mathbf{x}\right)}\mid \mathbf{a}, \mathbf{D}\_{\mathrm{sim}}, \mathbf{d}\_{\mathrm{exp}}\tilde{\mathcal{L}}\right) \, p\left(\mathbf{a}\mid \mathbf{D}\_{\mathrm{sim}}, d\_{\mathrm{exp}}, \tilde{\mathcal{L}}\right) \,. \tag{4}$$

As far as the (second) pdf for the model parameters is concerned, we can omit the information on the training set *D*sim, as it does not tell us anything about the model parameters. This pdf is actually the same as that in Equation (1), i.e., *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup>˜ = *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> , as it makes no difference for the model parameters how we solve the equations underlying the simulation. In the first pdf, we can omit the information on the experiment *d*exp, as we only need the simulation data *D*sim to fix the surrogate model, which in turn defines the observable *z*. The first pdf can be further specified by the marginalization rule upon introducing the surrogate parameters *<sup>C</sup>* <sup>=</sup> {*c*(*x*)}*Nx <sup>x</sup>*=1, where *x* is an index denoting the measurement points as introduced for *z*(*x*),

$$p\left(z^{\left(\mathbf{x}\right)}\mid\mathsf{a},\mathsf{D}\_{\text{sim}}\mathcal{Z}\right) = \int dV\_{\mathsf{C}} \, p\left(z^{\left(\mathbf{x}\right)}\mid\mathsf{C},\mathsf{a},\mathsf{D}\_{\text{sim}}\mathcal{Z}\right) p\left(\mathsf{C}\mid\mathsf{a},\mathsf{D}\_{\text{sim}},\mathcal{Z}\right) \, . $$

The first pdf is uniquely fixed by the knowledge of *a* and *C*, hence *D*sim is superfluous. Similarly, in the second pdf, where *C* is inferred from the training data, additional model parameters without the corresponding observables' values *z*, are useless. In summary, substituting the latter equation into Equation (4), we have

$$p\left(z^{\left(\mathbf{x}\right)}\mid d\_{\mathrm{exp}}, \mathbf{D}\_{\mathrm{sim}}, \tilde{\mathcal{Z}}\right) = \iint dV\_{\mathbf{d}} dV\_{\mathbf{C}} \ p\left(z^{\left(\mathbf{x}\right)}\mid \mathbf{C}, \mathbf{a}, \tilde{\mathcal{Z}}\right) p\left(\mathbf{C}\mid \mathbf{D}\_{\mathrm{sim}}, \tilde{\mathcal{Z}}\right) p\left(\mathbf{a}\mid d\_{\mathrm{exp}}, \mathcal{Z}\right) \dots \tag{5}$$

The first pdf is rather simple. According to the background information <sup>I</sup>˜, we will determine the observable via the surrogate model. Since the necessary parameters *<sup>c</sup>*(*x*) <sup>∈</sup> *<sup>C</sup>* and *a* are part of the conditional complex, the surrogate model allows only one value

$$z\_{\text{sur}}^{(x)} = \mathcal{g}(a \mid c^{(x)})\,.$$

for the observable. This means that *p <sup>z</sup>*(*x*) *<sup>|</sup> <sup>C</sup>*, *<sup>a</sup>*, <sup>I</sup>˜ is equivalent to the probability density function for *<sup>z</sup>*(*x*) given *<sup>z</sup>*(*x*) <sup>=</sup> *<sup>g</sup>*(*<sup>a</sup>* <sup>|</sup> *<sup>c</sup>*(*x*)). Hence, the pdf is a Dirac-delta distribution

$$p\left(z^{(\mathbf{x})} \mid \mathcal{C}, \mathfrak{a}, \tilde{\mathcal{Z}}\right) = \delta\left[z^{(\mathbf{x})} - g(\mathfrak{a} \mid \mathfrak{c}^{(\mathbf{x})})\right].\tag{6}$$

Finally, we have

$$p\left(z^{\left(\mathbf{x}\right)}\mid d\_{\mathrm{exp}}, \mathcal{D}\_{\mathrm{slm}}, \mathcal{Z}\right) = \iint dV\_{\mathbf{a}} dV\_{\mathbf{C}} \,\delta\left[z^{\left(\mathbf{x}\right)} - g\left(\mathbf{a}\mid \mathbf{c}^{\left(\mathbf{x}\right)}\right)\right] \, p\left(\mathbf{C}\mid \mathcal{D}\_{\mathrm{slm}}, \mathcal{Z}\right) \, p\left(\mathbf{a}\mid d\_{\mathrm{exp}}, \mathcal{Z}\right) \,. \tag{7}$$

Before we can evaluate this integral, we first need to determine the two terms, which have their own independent significance. The last term is the result of a data analysis of a specific foregoing experiment, and will therefore not be treated here. We will suppress the background information in the following, as ambiguities should no longer occur.

#### *2.2. Bayesian Analysis and Selection of the Surrogate Model*

We recall that Equation (7) allows for determining the pdf for the observable based on the pdf for the model parameters, and the pdf for the parameters of the surrogate model *p <sup>C</sup> <sup>|</sup> <sup>D</sup>*sim that we will determine now. To this end, we have to specify the form of the surrogate model. We use the expansion

$$z\_{\rm sur} = \sum\_{\nu=1}^{N\_p} c\_{\nu} \Phi\_{\nu}(\mathfrak{a}) \; , \tag{8}$$

in terms of basis functions Φ*ν*(*a*) and expansion coefficients *cν*. No further specification is needed at this point. Without loss of generality, we will use multi-variate Legendre polynomials for the numerical examples. This expansion is similar to the frequently used generalized Polynomial Chaos Expansfion [1], where the polynomials Φ*ν*(*a*) are orthogonal with respect to the *L*<sup>2</sup> inner product with the prior of *a*, *p a* , as an integration measure. However, here we actually consider a posterior *p <sup>a</sup> <sup>|</sup> <sup>d</sup>exp* that generally has no standard form, for which no standard orthogonal polynomial chaos basis is known, and for which conditional independence of the model parameters *a* does not hold. Polynomial Chaos Expansions have been extended to arbitrary probability measures [17] and dependent parameters [18], but, in the present context; however, these polynomials are not of primary interest and would only complicate the numerical evaluation. Note that the approach presented here does not demand any orthogonality properties for the basis, and thus avoid the practical problems encountered with the construction of such orthogonal bases. Parameters here may have complex dependence structures, and the only requirement for the probability distribution *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp is that the integrals with respect to it exist. As outlined in Section 2.1, *Ns* simulations are performed for a set of model parameters *<sup>A</sup><sup>s</sup>* <sup>=</sup> {*a*(*i*) *<sup>s</sup>* }*Ns <sup>i</sup>*=<sup>1</sup> and the corresponding observables *<sup>Z</sup><sup>s</sup>* are computed. The theory is so far agnostic to the experimental design of these simulations, and it is therefore not of concern here. Now, we want to determine the pdf for the surrogate parameters *C*, which are combined in a matrix with elements *Cν*,*x*, where *ν* enumerates the surrogate basis functions and *x* enumerates the measurement positions in the domain for which the observables are computed. We abbreviated the simulation data by the quantity *<sup>D</sup>*sim <sup>=</sup> {*As*, *<sup>Z</sup>s*}, where the matrix *Z<sup>s</sup>* has the elements (*Zs*)*i*,*x*, which represent the observable *z*(*x*) at position

*x* corresponding to the model parameter vector *a*(*i*) *<sup>s</sup>* . The sought-for pdf follows from Bayes' theorem

$$p\left(\mathcal{C}\mid D\_{\text{sim}}\right) = p\left(\mathcal{C}\mid \mathbf{Z}\_{\text{s}}, \mathbf{A}\_{\text{s}}\right) \propto p\left(\mathbf{Z}\_{\text{s}} \mid \mathcal{C}, \mathbf{A}\_{\text{s}}\right) p\left(\mathcal{C}\mid \mathbf{A}\_{\text{s}}\right) \propto p\left(\mathbf{Z}\_{\text{s}} \mid \mathcal{C}, \mathbf{A}\_{\text{s}}\right).$$

The proportionality constant is not required in the ensuing considerations, and we have assumed an ignorant, uniform prior for the coefficients *c*, i.e., *p <sup>C</sup> <sup>|</sup>* <sup>I</sup> = *const*. We note that this is also the transformation invariant Riemann prior (see Appendix B). However, any prior that is conjugate to the likelihood will retain analytical tractability. For the likelihood, we need the total misfit, which is given by

$$\begin{split} \chi^{2} &= \sum\_{i=1}^{N\_{\rm s}} \sum\_{x=1}^{N\_{\rm x}} \left( (\mathbf{Z}\_{\rm s})\_{i,x} - \sum\_{\nu=1}^{N\_{\rm p}} (M\_{\rm s})\_{i,\nu} (\mathbf{C})\_{\nu,x} \right)^{2} = \sum\_{i,\rm x} \left( \mathbf{Z}\_{\rm s} - M\_{\rm s} \mathbf{C} \right)\_{i,\rm x}^{2} \\ &= \text{tr}\left\{ \left( \mathbf{Z}\_{\rm s} - M\_{\rm s} \mathbf{C} \right)^{T} \left( \mathbf{Z}\_{\rm s} - M\_{\rm s} \mathbf{C} \right) \right\}, \end{split} \tag{9}$$

with (*Ms*)*i*,*<sup>ν</sup>* <sup>=</sup> <sup>Φ</sup>*ν*(*a*(*i*) *<sup>s</sup>* ) and *Nsx* = *Ns* · *Nx*. We assume a Gaussian type of likelihood, i.e.,

$$p\left(\mathbf{Z}\_{\rm s} \mid \mathbf{C}, A\_{\rm s}, \Lambda\right) = \frac{\Delta^{-N\_{\rm sx}}}{Z'} \exp\left\{-\frac{\lambda^2}{2\Delta^2}\right\}.\tag{10}$$

with normalization *Z*- . We have mentioned the Δ-dependence of the normalization explicitly, while the rest of of the normalization is irrelevant in the present context. Usually, the misfit entering the likelihood comes from the noise of the data. In the present case, however, there is no noise (merely a tiny numerical error), but the surrogate model is presumably not an exact description of the simulation data and Δ covers the corresponding uncertainty. However, the uncertainty level Δ is not known and has to be marginalized over. Along with the appropriate Jeffreys' prior, *p C*, Δ = *p C p* Δ , *p* Δ = <sup>1</sup> <sup>Δ</sup> , *p C* = *const*. (see Appendix B), the integration over Δ yields

$$p\left(\mathcal{C}\mid \mathbf{Z}\_s, \mathbf{A}\_s\right) = \frac{1}{Z} \left(\chi^2\right)^{-\frac{N\_{\rm tr}}{2}}.\tag{11}$$

with terms independent of *C* subsumed in the normalization *Z*. For computing the mean, variance, and evidence, we first complete the square in Equation (9) to get a quadratic form in *C*, which can then be integrated analytically (see Appendix A). The result is

$$
\langle \mathbf{C} \rangle\_{\mathfrak{a}} = H\_{\mathfrak{s}}^{-1} M\_{\mathfrak{s}}^{T} \mathbf{Z}\_{\mathfrak{s}} \,, \tag{12a}
\\
\begin{aligned}
H\_{\mathfrak{s}} &= M\_{\mathfrak{s}}^{T} M\_{\mathfrak{s}} \; , \\
\end{aligned} \tag{12a}
\\
\begin{aligned}
H\_{\mathfrak{s}} &= \frac{1}{2} M\_{\mathfrak{s}}^{T} M\_{\mathfrak{s}} \; , \\
\end{aligned} \tag{12a}
$$

$$\langle \Delta \mathbb{C}\_{\mathbb{V}\mathbb{Z}} \Delta \mathbb{C}\_{\mathbb{V}'\mathbb{X}'} \rangle\_{\mathfrak{a}} = \frac{\chi\_{\text{min}}^2}{(N\_{\mathfrak{s}} - N\_p)N\_{\mathfrak{x}} - 2} \left( H\_{\mathfrak{s}}^{-1} \right)\_{\mathbb{V}\mathbb{A}'} \delta\_{\mathfrak{x}\mathbb{X}'} \, , \qquad \chi\_{\text{min}}^2 = \text{tr} \left\{ Z\_{\mathfrak{s}}^T (\mathbf{1} - M\_{\mathfrak{s}} H\_{\mathfrak{s}}^{-1} M\_{\mathfrak{s}}^T) Z\_{\mathfrak{s}} \right\} \, . \tag{12b}$$

We argue that the prefactor of *H*−<sup>1</sup> *<sup>s</sup>* is the Bayesian estimate for Δ2, the variance of the Gaussian in Equation (10). This reasoning is similar to [19]. Note that *Z<sup>s</sup>* is a matrix of size *Ns* × *Nx*, containing the data vectors of length *Ns* for each measurement site *x*. As shown in Appendix A, Equation (A6), the evidence for a particular set of surrogate models is computed as

$$p\left(\left\{z\_{\text{sur}}^{\left(x\right)}\right\}\_{x=1}^{N\_x} \mid \mathcal{D}\_{\text{sim}}\mathcal{Z}\right) = Z = \Omega\_{N\_{px}} \left|H\_s\right|^{-\frac{1}{2}} \left(\chi\_{\text{min}}^2\right)^{-\frac{N\_{px} - N\_{px}}{2}} \frac{\Gamma\left(\frac{N\_{px}}{2}\right)\Gamma\left(\frac{N\_{x} - N\_{px}}{2}\right)}{\Gamma\left(\frac{N\_{px}}{2}\right)},\tag{13}$$

where Ω*Npx* is the solid angle in *Npx* dimensions. The evidence is the probability for a surrogate model given the data. Note that this quantity does not depend on *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> . This is reasonable because the analysis of the experimental data should be independent of the analysis of the simulation data. However, *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> will typically be used for

the experimental design of the simulation data acquisition. By comparing the evidence for different models, the user can choose a particular surrogate model or, if the results do not overwhelmingly suggest one single model, average the results for the surrogate analysis and the following uncertainty propagation over several plausible models. Note that the evidence is the pillar of a Bayesian procedure to select a surrogate model, and is distinct from the procedure of incorporating the trustworthiness or uncertainty of the surrogate in the subsequent uncertainty propagation.

#### *2.3. Bayesian Uncertainty Propagation with Surrogate Models*

Now that we have selected the surrogate model and determined the ingredients of Equation (7), we can determine the pdf for the observables in the light of the experimental data and the simulation results of the training set. The form in Equation (5) allows an easy evaluation of the mean value by using Equation (12) (see also Equation (A3)),

$$\begin{split} \left< z^{(\mathbf{x})} \right> &= \iint dV\_{\mathbf{a}} dV\_{\mathbf{C}} \, f(\mathbf{a} \mid \mathbf{c}^{(\mathbf{x})}) \, p\left(\mathbf{C} \mid \mathbf{D}\_{\text{sim}}, \tilde{\mathcal{Z}}\right) \, p\left(\mathbf{a} \mid d\_{\text{exp}}, \mathcal{Z}\right) \\ &= \sum\_{\mathcal{V}} \int dV\_{\mathbf{a}} \Phi\_{\mathcal{V}}(\mathbf{a}) \langle \mathcal{C}\_{\mathcal{V}\mathcal{X}} \rangle\_{\mathbf{a}} \, p\left(\mathbf{a} \mid d\_{\text{exp}}, \mathcal{Z}\right) \,. \end{split} \tag{14}$$

Similarly, we obtain (see Equations (12) and (A8))

$$\begin{split} \left< z^{(\mathbf{x})} z^{(\mathbf{x'})} \right> &= \iint dV\_{\mathbf{d}} dV\_{\mathbf{C}} \cdot f(\mathbf{a} \mid \mathbf{c}^{(\mathbf{x})}) \, f(\mathbf{a} \mid \mathbf{c}^{(\mathbf{x'})}) \, p\left(\mathbf{C} \mid \mathbf{D}\_{\text{sim}}, \mathcal{Z}\right) p\left(\mathbf{a} \mid \mathbf{d}\_{\text{exp}}, \mathcal{Z}\right) \\ &= \sum\_{\mathbf{v}\mathbf{v}'} \int dV\_{\mathbf{d}} \, \Phi\_{\mathbf{v}}(\mathbf{a}) \, \Phi\_{\mathbf{v}'}(\mathbf{a}) \, \langle \mathbf{C}\_{\mathbf{v}\mathbf{X}} \mathbf{C}\_{\mathbf{v}'\mathbf{X}} \rangle\_{\mathbf{d}} \, \ p\left(\mathbf{a} \mid \mathbf{d}\_{\text{exp}}, \mathcal{Z}\right) \\ &= \sum\_{\mathbf{v}\mathbf{v}'} \int dV\_{\mathbf{d}} \, \Phi\_{\mathbf{v}}(\mathbf{a}) \, \Phi\_{\mathbf{v}'}(\mathbf{a}) \left( \langle \mathbf{C}\_{\mathbf{v}\mathbf{x}} \rangle\_{\mathbf{d}} \langle \mathbf{C}\_{\mathbf{v}'\mathbf{X}'} \rangle\_{\mathbf{d}} + \langle \Delta \mathbf{C}\_{\mathbf{v}\mathbf{x}} \Delta \mathbf{C}\_{\mathbf{v}'\mathbf{X}'} \rangle\_{\mathbf{d}} \right) \, p\left(\mathbf{a} \mid \mathbf{d}\_{\text{exp}}, \mathcal{Z}\right) . \end{split} \tag{15}$$

The covariance then follows from

$$
\left\langle \Delta z^{(\mathbf{x})} \Delta z^{(\mathbf{x'})} \right\rangle = \left\langle z^{(\mathbf{x})} z^{(\mathbf{x'})} \right\rangle - \left\langle z^{(\mathbf{x})} \right\rangle \left\langle z^{(\mathbf{x'})} \right\rangle. \tag{16}
$$

If we neglected the uncertainty of the surrogate, i.e.,

$$\begin{aligned} p\left(\mathcal{C} \mid D\_{sim}\right) &= \delta\left(\mathcal{C} - \hat{\mathcal{C}}\right), \\ \hat{\mathcal{C}} &= \left\langle \mathcal{C} \right\rangle\_{\mathbf{a}, \mathbf{a}'} \end{aligned}$$

then we retain the widely known special case of 'perfectly trustworthy' surrogates

$$\left\langle z^{(\mathbf{x})} z^{(\mathbf{x'})} \right\rangle = \sum\_{\nu\nu'} \int dV\_{\mathbf{a}} \,\, \Phi\_{\nu}(\mathbf{a}) \, \, \Phi\_{\nu'}(\mathbf{a}) \, \left\langle \mathbf{C}\_{\nu\mathbf{x}} \right\rangle\_{\mathbf{a}} \left\langle \mathbf{C}\_{\nu'\mathbf{x'}} \right\rangle\_{\mathbf{a}} \, \, p \left(\mathbf{a} \mid \, d\_{\exp'} \mathcal{Z} \right) \, \, .$$

Thus, the first part in the integral of Equation (15) is the uncertainty of the observable due to experimental uncertainties and given the surrogate model, while the second term adds the uncertainty of the surrogate itself. The term Δ*Cνx*Δ*Cνx <sup>a</sup>* is commonly neglected, but easily computed. This result also suggests a natural measure for the trustworthiness of the surrogate model, which is directly linked to the specific experiment:

$$\frac{\sum\_{\boldsymbol{\nu}\boldsymbol{\nu}'} \int dV\_{\boldsymbol{a}} \, \Phi\_{\boldsymbol{\nu}}(\boldsymbol{a}) \, \Phi\_{\boldsymbol{\nu}'}(\boldsymbol{a}) \, \langle \boldsymbol{\Delta C\_{\boldsymbol{\nu}\boldsymbol{\mathcal{X}}} \boldsymbol{\Delta C\_{\boldsymbol{\nu}'\boldsymbol{\mathcal{X}}}} \rangle\_{\boldsymbol{a}} \, p\Big(\boldsymbol{a} \mid \boldsymbol{d}\_{\exp\mathcal{I}} \mathcal{Z}\big)}{\sum\_{\boldsymbol{\nu}\boldsymbol{\mathcal{X}}'} \int dV\_{\boldsymbol{a}} \, \Phi\_{\boldsymbol{\nu}}(\boldsymbol{a}) \, \Phi\_{\boldsymbol{\nu}'}(\boldsymbol{a}) \, \langle \boldsymbol{\mathcal{C}\_{\boldsymbol{\nu}\boldsymbol{\mathcal{X}}}} \rangle\_{\boldsymbol{a}} \, p\Big(\boldsymbol{a} \mid \boldsymbol{d}\_{\exp\mathcal{I}} \mathcal{Z}\big)} < \epsilon \,. \tag{17}$$

If the surrogate uncertainties are, on average, smaller than the experimental uncertainties by a few orders of magnitude, e.g., = 10<sup>−</sup>3, then they may be neglected. However, is the user's choice. Note that this result does not spare the user to solve the foregoing surrogate model selection problem by e.g., computing evidence. This work only demonstrates how surrogate uncertainties can be incorporated and a practical rule when they could be neglected, given that the surrogate model has already been selected before.

#### **3. Numerical Example**

Here, we demonstrate an application where surrogate uncertainties were in part negligible and in part non-negligible. We apply our method to a computational fluid dynamics simulation of aortic hemodynamics, i.e., blood flow in an aorta resembled by the simplified geometry of an upside down umbrella stick. The simulation depends on a non-Newtonian viscosity model with four parameters *<sup>a</sup>* <sup>=</sup> {*a*1, *<sup>a</sup>*2, *<sup>a</sup>*3, *<sup>a</sup>*4}. The model was accompanied by viscosity measurements of human blood samples, thus determining *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> . This posterior turned out to have a complex landscape that cannot be reasonably approximated by standard distributions. Particularly, strong correlations and multi-modality were observed, i.e., *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> = ∏*<sup>i</sup> p ai <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> . This means that vanilla Polynomial Chaos Expansions could not be applied without an undesirable transformation to conditionally independent variables. The posterior is described in detail in [20]. Based on *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> , *Ns* = 100 parameter samples *a<sup>s</sup>* were chosen and the simulation evaluated accordingly. The output, *Zs*, was the absolute values of the wall shear stress that the blood flow exerts on the aortic wall, for *Nx* = 10 measurement probes at different locations, each for *Nt* = 101 time-instances equidistantly spaced over one cardiac cycle (ca. 1 s). Further details on the simulation are not relevant here but are documented [20]. A simulation time on the order of 150 CPU hours per sample suggested to use a surrogate for the inference. For the surrogate's basis functions, Φ*ν*(*a*), we found multivariate Legendre polynomials of up to order two sufficient. The numerical integrals were computed with Riemannian quadrature and convergence checked with successive grid refinement; however, stochastic integration would work just as well. A sketch example on how to implement this procedure computationally efficient via vectorisation in parameter space can be found at https://github.com/Sranf/BayesianSurrogate\_sketch.git (1 January 2021).

In Figure 1, we compare the simulation uncertainty (including surrogate uncertainty) as computed with our Bayesian approach (Equation (15)) to the naive estimate for the simulation uncertainty (without surrogate uncertainty, i.e., neglecting Δ*Cνx*Δ*Cνx <sup>a</sup>* in Equation (15). The surrogate uncertainties in the first half (left hand side) are relatively small, comprising only a few percent of the total uncertainty, and could possibly be neglected. In the second half (right-hand side), however, the surrogate uncertainties make up to ∼50% of the total uncertainty. This demonstrates that simulation uncertainties inferred via surrogate models can be severely underestimated if the surrogate uncertainties are neglected, and subsequently lead to overconfidence in the simulation model. In practice, one would acquire more data in order to reduce the surrogate uncertainties, e.g., more data at later time-instances in Figure 1 are particularly promising. This was limited here not only by the computational budget, but also the impracticality in that dynamic simulations require the full evaluation of all previous time instances where the surrogate is already reasonably accurate. Thus, the procedure of instead explicitly including the surrogate uncertainties here also has proven to be practical. A similar situation is to be expected for most transient simulations, as uncertainties will usually increase as time progresses.

**Figure 1.** Simulation data (black dots) and simulation uncertainty (1*σ*) according to our Bayesian approach (red, including surrogate uncertainty) as well as the naive simulation uncertainty (blue, neglecting surrogate uncertainty). The black line is the surrogate mean.

#### **4. Discussion**

In this work, we have assumed a Gaussian likelihood for the simulation data, with unknown variance, for a surrogate that is linear in its parameters. Surrogates that are nonlinear in its parameters (e.g., neural networks) may promise higher capacity, however at the expense of losing analytical tractability of the surrogate uncertainty entirely. Other likelihood functions might be useful if further information is available, such as bounds on the observable (Gamma- or Beta-likelihood).

The result is a simple formula to incorporate surrogate uncertainties in the simulation uncertainties. This formula will be particularly useful if 'convergence' in the sense of finding the coefficients of e.g., a Polynomial Chaos Expansion is doubtful or not achievable due to a limit to the computational budget. The formula immediately suggests an intrinsic measure for the trustworthiness of the surrogate, distinct from commonly used ad hoc diagnostics. This measure is not to be confused with the evidence and should not be used for model selection because it would not preclude over-fitting, etc. It is merely a measure for the trustworthiness of the already selected surrogate.

Let us now explore the connections of this work to Polynomial Chaos Expansions (PCE) and Gaussian Process Regression (GPR). PCE is a special case of our generalized linear surrogate model, in that the basis functions of the surrogate are chosen such that

$$\int \Phi\_{\boldsymbol{\nu}}(\boldsymbol{a}) \Phi\_{\boldsymbol{\nu}'}(\boldsymbol{a}) p(\boldsymbol{a} \mid \boldsymbol{d}\_{\mathrm{exp}'} \mathcal{X}) \, dV\_{\mathbf{a}} := \delta\_{\boldsymbol{\nu}, \boldsymbol{\nu}'} \,. \tag{18}$$

The double sum in Equation (15) then contracts to a single sum, and the diagonal of the term for the surrogate uncertainty, Δ*Cνx*Δ*Cνx <sup>a</sup>*, survives. This is expected, in that PCE is defined such that the basis functions are uncorrelated, but still the expansion coefficients must be uncertain to a finite degree, and this must carry over to the simulation uncertainty. A severe limitation of PCE is that it is rather difficult to find basis function sets {Φ*ν*} that fulfill Equation (18), depending on *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> . For most practical purposes, one demands (i) conditional independence of the simulation parameters, i.e., *p <sup>a</sup> <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> = ∏*<sup>i</sup> p ai <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> , as well as (ii) simple standard distributions for *p ai <sup>|</sup> <sup>d</sup>*exp, <sup>I</sup> , in order to find a solution (usually a tensor-product) to Equation (18). Known albeit tedious workarounds are for (i) variable transformations and numerical orthonormalisation [18] and for (ii) PCE-constructions for arbitrary pdfs [17]. Note that also [17] demands (i) conditional independence of the simulation parameters. Note that our approach is not afflicted by above considerations, unless Equation (18) is specifically demanded. In the numerical example above, neither (i) nor (ii) were applicable. Finding a variable transformation

in order to fulfill (i) or numerical construction of orthonormal basis functions can be difficult, and particularly inconvenient if sophisticated priors *p <sup>a</sup> <sup>|</sup>* <sup>I</sup> are being used, e.g., Jeffreys' generalized prior. An interesting alternative would be to model the input dependencies with vine copulas [21] in order to overcome the limitations of PCE addressed here. Unfortunately, no obvious vine copula was found for the example presented here.

Gaussian Process Regression would correspond to a change in the prior for *z*(*x*) in Equation (6) as follows:

$$p\left(z^{\left(x\right)}\mid\mathcal{C},\mathfrak{a},\theta,\bar{\mathcal{I}}\right) = \mathcal{N}\left(\mathcal{g}\left(\mathfrak{a}\mid\mathfrak{c}^{\left(x\right)}\right)\middle|\mathcal{K}(\theta)\right).\tag{19}$$

where N denotes a normal distribution, and *K* is the prior's covariance matrix and defined by the parametrized covariance function *<sup>k</sup>*, [*K*]*ij* <sup>=</sup> *<sup>k</sup>*(*a*(*i*), *<sup>a</sup>*(*j*) <sup>|</sup> *<sup>θ</sup>*). This in turn would change (*Z<sup>s</sup>* <sup>−</sup> *MsC*)*T*(*Z<sup>s</sup>* <sup>−</sup> *MsC*) <sup>→</sup> (*Z<sup>s</sup>* <sup>−</sup> *MsC*)*TK*−1(*Z<sup>s</sup>* <sup>−</sup> *MsC*) in Equation (9). By again completing the square and following the same procedure, the corresponding results for mean Equation (12a), variance Equation (12b), and evidence Equation (13) are then retained by a simple substitution of

$$\begin{aligned} \tilde{H}\_{\mathfrak{s}} & \rightarrow \tilde{H}\_{\mathfrak{s}} & \qquad \tilde{H}\_{\mathfrak{s}} = M\_{\mathfrak{s}}^{T} K\_{\mathfrak{s}}^{-1} M\_{\mathfrak{s}} \\ \chi^{2}\_{\text{min}} & \rightarrow \tilde{\chi}^{2}\_{\text{min}} \ & \qquad \tilde{\chi}^{2}\_{\text{min}} = \text{tr} \left\{ \mathbf{Z}\_{\mathfrak{s}}^{T} (K\_{\mathfrak{s}}^{-1} - K\_{\mathfrak{s}}^{-1} M\_{\mathfrak{s}} \tilde{H}\_{\mathfrak{s}} M\_{\mathfrak{s}}^{T} K\_{\mathfrak{s}}^{-1}) \mathbf{Z}\_{\mathfrak{s}} \right\} \ , \end{aligned}$$

where [*Ks*]*ij* <sup>=</sup> *<sup>k</sup>*(*a*(*i*) *<sup>s</sup>* , *<sup>a</sup>*(*j*) *<sup>s</sup>* <sup>|</sup> *<sup>θ</sup>*) is the likelihood's covariance matrix evaluated at *<sup>A</sup><sup>s</sup>* for the data set *Z<sup>s</sup>* at given *θ*. Equations (14) and (15) would preserve their form with the substitution

<sup>Φ</sup>*ν*(*a*)*<sup>C</sup> <sup>a</sup>* <sup>→</sup> <sup>Φ</sup>*ν*(*a*)*<sup>C</sup> <sup>a</sup>*,*<sup>θ</sup>* <sup>+</sup> *<sup>K</sup><sup>T</sup>* <sup>∗</sup> *<sup>K</sup>*−<sup>1</sup> *<sup>s</sup> Ms<sup>C</sup> <sup>a</sup>*,*<sup>θ</sup>* , Δ*Cνx*Δ*Cνx <sup>a</sup>* → Δ*Cνx*Δ*Cνx a*,*θ <sup>K</sup>* <sup>−</sup> *<sup>K</sup><sup>T</sup>* <sup>∗</sup> *<sup>K</sup>*−<sup>1</sup> *<sup>s</sup> K*<sup>∗</sup> ,

where the subscript *θ* acknowledges that the right-hand side now depends on *θ*, and [*K*∗]*ij* <sup>=</sup> *<sup>k</sup>*(*a*(*i*), *<sup>a</sup>*(*j*) *<sup>s</sup>* <sup>|</sup> *<sup>θ</sup>*) is the covariance between the training set *<sup>A</sup><sup>s</sup>* and the 'test set', i.e., the integration variable *a*. Note that the additionally introduced hyperparameters *θ* would require the choice of a prior for *θ* and marginalization wrt *θ* in Equation (7), and subsequently also Equations (12)–(15).

We now discuss the implications of <sup>I</sup>˜ in contrast to the original background information I. I contains, most importantly, that the observable *z* is uniquely determined by the simulation for a given set of input parameters *a*. A prerequisite here was that the simulation is converged. For example, for finite element simulations, this would be a given mesh-converged spatial discretization. The proposition <sup>I</sup>˜ additionally assumes Equations (3) and (6), so that it can be used to replace Equation (2) by Equation (4). Formally, this means, to get from Equations (2)–(4), we replace *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>a</sup>*, <sup>I</sup>) <sup>→</sup> *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>a</sup>*, *<sup>c</sup>*, <sup>I</sup>˜), where *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>a</sup>*, <sup>I</sup>) = *<sup>δ</sup>*(*<sup>z</sup>* <sup>−</sup> *<sup>z</sup>*(*a*)), *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>a</sup>*, *<sup>c</sup>*, <sup>I</sup>˜) = *<sup>δ</sup>*(*<sup>z</sup>* <sup>−</sup> *zsur*(*a*)) = *<sup>δ</sup>*(*<sup>z</sup>* <sup>−</sup> *<sup>g</sup>*(*<sup>a</sup>* <sup>|</sup> *<sup>c</sup>*)). For example, <sup>I</sup>˜ contains in comparison to <sup>I</sup> the additional assumption that we can use the value for *z* as predicted/approximated by the surrogate model. It also means that we introduce additional, artificial, and usually unknown regression parameters *c* that need to be marginalized over. The additional uncertainty introduced by this approximation (i.e., the surrogate assumption) is encoded in *<sup>p</sup>*(*c*|*D*sim, <sup>I</sup>˜), and is correctly incorporated into the simulation observable uncertainties in Equation (15). What is important here is that *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>d</sup>*exp, *<sup>D</sup>*sim, <sup>I</sup>˜) <sup>=</sup> *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>d</sup>*exp, ✟*D*sim✟✟, <sup>I</sup>) in general (the latter is computationally infeasible), but *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>d</sup>*exp, *<sup>D</sup>*sim, <sup>I</sup>˜) <sup>≈</sup> *<sup>p</sup>*(*<sup>z</sup>* <sup>|</sup> *<sup>d</sup>*exp, *<sup>D</sup>*sim, <sup>I</sup>) if Equation (3) holds and *<sup>p</sup>*(*<sup>C</sup>* <sup>|</sup> *<sup>D</sup>*sim) <sup>≈</sup> *<sup>δ</sup>*(*<sup>C</sup>* <sup>−</sup> *<sup>C</sup>*<sup>ˆ</sup> ), i.e., if the surrogate is indeed a good approximation and the posterior for the surrogate parameters is sharply peaked at *C*ˆ . Very often, this posterior is not sharply peaked; then, we can just gather more data until it is, or, if that is not possible, we

can at least avoid overconfidence induced by neglecting these uncertainties. A numerical example of where this is the case has been demonstrated above.

We have modeled spatial correlations by introducing a location index *x*, and assumed that the expansion coefficients at different sites, *c*(*x*) and *c*(*x*- ), are conditionally independent. This assumption is reasonable, in that the expansion coefficients are arbitrary mathematical constructs and no physically motivated model for their correlation is known. The spatial correlation, however, is retained in *z*(*x*), as was originally intended. More general models for spatial correlations can easily be implemented by substitution of *δxx* with a spatial covariance matrix in Equation (12b). Note that this would require an additional marginalization wrt the (typically nonlinear) hyperparameters of the spatial covariance matrix. By introducing a compound index *x*˜ = (*x*, *t*) and substituting *x* → *x*˜, we find a simple generalization to spatio-temporal correlations. This is equivalent to re-ordering spatial and temporal indices into a single sequence. While this procedure is convenient and requires only minor changes in the numerical implementation, it implicitly assumes conditional independence of spatial and temporal correlations. Analogous to above, general temporal correlations can be modeled by a substitution of *δx*˜*x*˜ in Equation (12b) with a temporal covariance matrix, again requiring an additional marginalization wrt the latter's hyperparameters.

#### **5. Conclusions**

We presented a Bayesian analysis of surrogate models and its associated uncertainty propagation problem in the context of uncertainty quantification of computer simulations. The assumptions were a generalized linear surrogate model (linear in its parameters, not the variable) and a Gaussian likelihood with unknown variance. Additionally, spatial and temporal correlations have been discussed. The result suggests a measure of trustworthiness of the surrogate by quantifying the ratio of the surrogate uncertainty to the total uncertainty, in contrast to commonly used heuristic diagnostics. The main result, however, is a rather simple rule to include surrogate uncertainties in the sought-for uncertainties of the simulation output. This is useful particularly for problems where the surrogate's trustworthiness is doubtful and cannot be improved. The connections to Polynomial Chaos Expansions and Gaussian Process Regression have been discussed. A numerical example demonstrated that simulation uncertainties can be significantly underestimated if surrogate uncertainties are neglected.

**Funding:** This work was funded by Graz University of Technology (TUG) through the LEAD Project "Mechanics, Modeling, and Simulation of Aortic Dissection" (biomechaorta.tugraz.at) and supported by GCCE: Graz Center of Computational Engineering.

**Data Availability Statement:** All information is contained in the manuscript. Code sketches are available at https://github.com/Sranf/BayesianSurrogate\_sketch.git (1 January 2021).

**Acknowledgments:** The authors are grateful for useful comments from Ali Mohammad-Djafari.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Mathematical Proofs**

Here, we want to determine norm, mean, and covariance of the marginalized Gaussian (Student-t distribution) in Equation (11), which is

$$\begin{split}p\left(\mathsf{C}\mid\mathsf{Z}\_{\mathsf{s}},\mathsf{A}\_{\mathsf{s}}\right) &= \frac{1}{Z} \left(\boldsymbol{\chi}^{2}\right)^{-\frac{N\_{\mathsf{s}\mathsf{c}}}{2}},\\\boldsymbol{\chi}^{2} &= \mathsf{tr}\left\{\left(\mathsf{Z}\_{\mathsf{s}} - M\_{\mathsf{s}}\mathsf{C}\right)^{T}\left(\mathsf{Z}\_{\mathsf{s}} - M\_{\mathsf{s}}\mathsf{C}\right)\right\}.\end{split} \tag{A1}$$

In order to perform the integration, we first complete the square to get a quadratic form in *C*, which can then be integrated analytically, i.e., we bring the misfit *χ*<sup>2</sup> into a form that elucidates the *C*-dependence

$$\begin{aligned} \boldsymbol{\chi}^2 &= \boldsymbol{\chi}\_{\text{min}}^2 + \text{tr}\left\{ \left( \mathbf{C} - \hat{\mathbf{C}} \right)^T \boldsymbol{H}\_\text{s} \left( \mathbf{C} - \hat{\mathbf{C}} \right) \right\}, & \boldsymbol{H}\_\text{s} &= \boldsymbol{M}\_\text{s}^T \boldsymbol{M}\_\text{s} \; , \\ \boldsymbol{\chi}^2\_{\text{min}} &= \text{tr}\left\{ \mathbf{Z}\_\text{s}^T \left( \mathbf{1} - \boldsymbol{M}\_\text{s} \boldsymbol{H}\_\text{s}^{-1} \boldsymbol{M}\_\text{s}^T \right) \mathbf{Z}\_\text{s} \right\}, & \boldsymbol{\hat{\mathbf{C}}} &= \boldsymbol{H}\_\text{s}^{-1} \boldsymbol{M}\_\text{s}^T \mathbf{Z}\_\text{s} \; , \end{aligned}$$

Now, the first moment is easily obtained. Along with the variable transformation under the integral

$$
\mathcal{C} \to \mathcal{C} + X \tag{A2}
$$

we obtain

$$\begin{split} \langle \mathbf{C} \rangle &= \frac{1}{Z} \int dV\_{\mathbf{C}} \, \mathbf{C} \left( \text{tr} \left\{ (\mathbf{C} - \hat{\mathbf{C}})^{T} H\_{s} (\mathbf{C} - \hat{\mathbf{C}}) \right\} + \chi\_{\text{min}}^{2} \right)^{-\frac{N\_{\text{rx}}}{2}} \\ &= \hat{\mathbf{C}} + \frac{1}{Z} \underbrace{\int dV\_{\mathbf{X}} \, \mathbf{X} \left( \mathbf{X}^{T} H\_{s} \mathbf{X} + \chi\_{\text{min}}^{2} \right)^{-\frac{N\_{\text{rx}}}{2}}}\_{=0} . \end{split} \tag{A.3}$$

where we have used the symmetry properties of the likelihood. Next, we transform the expression for normalization based on Equation (A2)

$$Z\_{N\_{\rm tr}} = \int dV\_{\rm X} \left( \text{tr} \left\{ \mathbf{X}^T H\_s \mathbf{X} \right\} + \chi^2\_{\rm min} \right)^{-\frac{N\_{\rm tr}}{2}}.$$

Now, we combine and reorder the double indices (*ν*, *x*) into a single index *l*, which turns the matrix *<sup>X</sup>* of dimension *Np* <sup>×</sup> *Nx* into a vector *<sup>x</sup>* of dimension *Npx* <sup>=</sup> *Np* · *Nx* and the matrix *<sup>H</sup>* of dimension *Np* <sup>×</sup> *Np* into a new block matrix *<sup>H</sup>* of dimension *Npx* <sup>×</sup> *Nsx* such that

$$[H]\_{ll'} = [H\_{\mathfrak{s}}]\_{\nu,\nu'} \,\delta\_{\mathbf{x}\mathbf{x}'} \,. \tag{A4}$$

In this representation, we have

$$Z\_{N\_{\rm tx}} = \int dV\_{\mathbf{x}} \left(\mathbf{x}^T H \mathbf{x} + \boldsymbol{\chi}\_{\rm min}^2\right)^{-\frac{N\_{\rm tx}}{2}} = |H|^{-\frac{1}{2}} \int dV\_{\mathbf{y}} \left(\mathbf{y}^T \boldsymbol{y} + \boldsymbol{\chi}\_{\rm min}^2\right)^{-\frac{N\_{\rm tx}}{2}},\tag{A5}$$

where we substituted *<sup>x</sup>* <sup>→</sup> *<sup>H</sup>*<sup>−</sup> <sup>1</sup> <sup>2</sup> *y*. Next, we introduce hyper-spherical coordinates, which leads to

$$Z\_{N\_{\rm str}} = \Omega\_{N\_{\rm prx}} \left| H \right|^{-\frac{1}{2}} \int\_0^\infty \frac{d\rho}{\rho} \,\rho^{N\_{\rm prx}} (\rho^2 + \chi\_{\rm min}^2)^{-\frac{N\_{\rm cr}}{2}} \,\rho$$

where Ω*Npx* is the solid angle in *Npx* dimensions. Finally, based on the substitution *ρ* = *t* · *χ*2 min, we recover an identity of the Beta-function, and we obtain

$$Z\_{N\_{\rm rx}} = \Omega\_{N\_{\rm px}} \left| H \right|^{-\frac{1}{2}} \left( \chi^2\_{\rm min} \right)^{-\frac{N\_{\rm rx} - N\_{\rm px}}{2}} \frac{\Gamma(\frac{N\_{\rm px}}{2}) \Gamma(\frac{N\_{\rm rx} - N\_{\rm px}}{2})}{\Gamma(\frac{N\_{\rm rx}}{2})} . \tag{A6}$$

This result is valid only for *Nsx* > *Npx*, which is fulfilled in the present application. For future use, we rewrite this as

$$Z\_{N\_{\rm str}} = Z\_{\left(N\_{\rm str} - 2\right)} \cdot \left(\chi\_{\rm min}^2\right)^{-1} \cdot \frac{N\_{\rm sx} - N\_{\rm px} - 2}{N\_{\rm sx} - 2} \,. \tag{A7}$$

Finally, we calculate the covariance, based also on the compound index *l* = (*ν*, *x*), and by using the variable transformation in Equation (A2):

$$
\begin{split}
\langle\Delta\mathcal{C}\_{l}\Delta\mathcal{C}\_{l'}\rangle &= \frac{1}{Z\_{N\_{\text{Tx}}}} \int dV\_{\text{x}} \,\mathbf{x}\_{l}\mathbf{x}\_{l'} \left(\mathbf{x}^{T}\mathbf{H}\mathbf{x} + \chi^{2}\_{\text{min}}\right)^{-\frac{\text{Max}}{2}} \\
&= -\frac{2}{N\_{\text{xx}}-2} \cdot \frac{1}{Z\_{N\_{\text{Tx}}}} \cdot \frac{\partial}{\partial\mathcal{H}\_{l,l'}} \int dV\_{\text{x}} \left(\mathbf{x}^{T}\mathbf{H}\mathbf{x} + \chi^{2}\_{\text{min}}\right)^{-\frac{\text{Min}}{2}+1}, \\
&= -\frac{2}{N\_{\text{xx}}-2} \cdot \frac{\lambda^{2}\_{\text{min}}\cdot(N\_{\text{x}x}-2)}{Z\_{(N\_{\text{xx}}-2)}\cdot(N\_{\text{Sx}}-N\_{\text{px}}-2)} \left.\frac{\partial}{\partial H\_{l\text{I}}}\right|\_{} \mathcal{Z}\_{(N\_{\text{xx}}-2)}\,, \\
&= -\frac{2\chi^{2}\_{\text{min}}}{\left(N\_{\text{Sx}}-N\_{\text{px}}-2\right)} \left.\frac{\partial}{\partial H\_{l\text{I}'}}\ln(Z\_{(N\_{\text{x}}-2)})\,, \\
&= -\frac{2\chi^{2}\_{\text{min}}}{\left(N\_{\text{Sx}}-N\_{\text{px}}-2\right)} \underbrace{\frac{\partial}{\partial H\_{l\text{I}'}}\ln(|H|^{-\frac{1}{2}})}\_{}.
\end{split} \tag{A8}
$$

In the last step, we have used that *H* is a symmetric matrix. This is a very reasonable result because, if the variance Δ<sup>2</sup> in the Gaussian in Equation (10) would be known, then the covariance is Δ2*H*−1. Consequently, the prefactor represents the Bayesian estimate for the variance Δ<sup>2</sup> based on the data. Now, we go back to the original meaning of the compound index Equation (A4), i.e., *H*−<sup>1</sup> *ll*- <sup>→</sup> (*H*−<sup>1</sup> *<sup>s</sup>* )*νν δxx*-, and obtain the final result.

#### **Appendix B. The Transformation Invariant Prior for the Surrogate Coefficients**

Bayesian probability theory allows for rigorously and consistently incorporating any prior knowledge we have about the experiment before taking a look at the data. This knowledge shall be elicited here. Our inference must not depend on the exact parametrization. e.g., if we re-parametrize the surrogate, re-label, or re-order the surrogate parameters, the surrogate still should describe the same simulation. This is reasonable because the surrogate is a purely mathematical, auxiliary construct. This rescaling-invariance is ensured by Jeffreys' generalized prior and is given by the Riemann metric *R* (or the determinant of the Fisher information matrix) [16]

$$p(\mathbf{C}) = \frac{1}{Z} \Big|\det(\mathbf{R})\Big|^{1/2} \quad \text{with} \quad \mathcal{R}\_{ij} = \int p\left(\mathbf{Z}\_s \mid \mathbf{C}\right) \frac{\partial^2}{\partial \mathbf{C}\_i \partial \mathbf{C}\_j} \ln\left(p\left(\mathbf{Z}\_s \mid \mathbf{C}\right)\right) \,dV\_{\mathbf{Z}\_s} \,. \tag{A9}$$

with multi-indices *i*, *j* = (*ν*, *x*). With the likelihood and the generalized surrogate model defined in the manuscript, the result is

$$\begin{split} R\_{ij} &\propto \sum\_{k=1}^{N\_s} \frac{\partial \mathcal{G}(\mathbf{a}\_s^{(k)} \mid \mathcal{C})}{\partial \mathcal{C}\_i} \frac{\partial \mathcal{G}(\mathbf{a}\_s^{(k)} \mid \mathcal{C})}{\partial \mathcal{C}\_j} \\ &= \sum\_{k=1}^{N\_s} \Phi\_i(\mathbf{a}\_s^{(k)}) \Phi\_j(\mathbf{a}\_s^{(k)}) \\ &= const. \end{split} \tag{A10}$$

This prior is independent of *C*, i.e., a constant.

#### **References**

