*2.2. Gaussian Process Regression with Noise*

Gaussian process regression (GPR) [23] is a non-parametric prediction model based on the Gaussian prior distribution. The two main advantages of GPR are the interpretability between the prediction and observations, and the probabilistic sense when some prior models are embedded. In the past decades, theoretical research and real-world application have proved that GPR is a powerful tool for supervised learning applications [24]. Given a dataset D = 0 *X*, *y* 1 , where *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*m*, *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*n*×1, *<sup>n</sup>* is the sample size, and *m* is the sample dimension. Assume the regression function *f* mapping an input vector *x* to an output value *y* can be written as:

$$y = f(\mathbf{x}) + \epsilon \tag{12}$$

where noise is the noise with Gaussian distribution N 0, σ<sup>2</sup> *n* and the "signal" term *f*(*x*) and noise are mutually independent. The signal term *f*(*x*) is also assumed to be a random variable with Gaussian distribution.

$$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) \tag{13}$$

where *m*(*x*) = *E*(*f*(*x*)) is a mean function which often set to 0, and *k*(*x*, *x* ) = *E*[(*f*(*x*) − *m*(*x*))(*f*(*x* ) − *m*(*x* ))] is a covariance that illustrates prior assumptions including likely smoothness and patterns in the data. The covariance function *k* is also identified as the *kernel function* of Gaussian process [25].

Given a collected data set D = 0 *X*, *y* 1 , a predicted signal function *f*<sup>∗</sup> should be constructed in order to forecast a new output *y*∗ based on a new input *x*∗ . Once we have determined the mean function and the kernel, the predicted function *f*<sup>∗</sup> can be sampled as follows.

$$f\_\* \sim \mathcal{N}(\mathbf{0}, k(\mathbf{x}^\*, \mathbf{x})) \tag{14}$$

Then, the joint probabilistic distribution of the training outputs *y* and the predicted function *f*<sup>∗</sup> can be written as:

$$
\begin{bmatrix}
\mathbf{y} \\
\mathbf{f}\_{\star}
\end{bmatrix} = \mathcal{N}(0, \begin{bmatrix}
\mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma\_n^2 \mathbf{I}\_n & \mathbf{K}(\mathbf{X}, \mathbf{x}\_{\star}) \\
\mathbf{K}(\mathbf{x}\_{\star}, \mathbf{X}) & \mathbf{k}(\mathbf{x}\_{\star}, \mathbf{x}\_{\star})
\end{bmatrix}) \tag{15}
$$

where *K*(*X*, *X*) denotes the covariance matrix between all training inputs, *K*(*X*, *x*∗) represents the covariance matrix between the training inputs and test inputs, *K*(*x*∗, *X*) stands for the covariance matrix between the test inputs and training inputs, *k*(*x*∗, *x*∗) is the covariance between test inputs. *I<sup>n</sup>* is an identity matrix and σ<sup>2</sup> *<sup>n</sup>* is the assumed variance of training samples.

The main task of GPR is to forecast the most likely value of *y*∗ related to *x*∗ . Based on the Bayes' principle, the conditional distribution is concluded [23] as:

$$p\left(\left(f\_\*\middle|\mathbf{x}\_\*,\mathbf{X},\boldsymbol{y}\right)\right) \sim \mathcal{N}(m\_\*, \mathbb{C}ov(f\_\*))\tag{16}$$

$$\mathcal{L}m\_{\ast} = K(\mathbf{x}\_{\ast}, \mathbf{X}) \left[ K(\mathbf{X}, \mathbf{X}) + \sigma\_n^2 I\_{\text{fl}} \right]^{-1} \mathbf{y} \tag{17}$$

$$\text{Cov}(f\_{\star}) = k(\mathbf{x}\_{\star}, \mathbf{x}\_{\star}) - K(\mathbf{x}\_{\star}, \mathbf{X}) \left[ K(\mathbf{X}, \mathbf{X}) + \sigma\_{\text{n}}^{2} \mathbf{I}\_{\text{n}} \right]^{-1} K(\mathbf{x}\_{\star}, \mathbf{X}) \tag{18}$$

Based on these theoretical analysis, the mean and covariance function are the two most important elements in GPR. The kernel function *k* directly illustrates prior knowledge about the function *f*, and the combinations between two different kernel functions still can be identified as a kernel [25]. In this paper, we use a composite covariance function with the squared exponential kernel function *k*1(*x*, *x* ) Equation (19) to express smooth trend of the data and the exponential kernel functions *k*2(*x*, *x* ) Equation (20) to illustrate the irregularity of the data.

$$k\_1(\mathbf{x}, \mathbf{x}') = \sigma\_{f\_1}^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^T\begin{bmatrix} l\_{s1}^2 & & \\ & \ddots & \\ & & l\_{sp}^2 \end{bmatrix}^{-1} (\mathbf{x} - \mathbf{x}')\right) \tag{19}$$

$$k\_2(\mathbf{x}, \mathbf{x}') = \sigma\_{f\_2}^2 \exp\left[-\sqrt{\begin{matrix} l\_{c1}^2 \\ (\mathbf{x} - \mathbf{x}') \\ \end{matrix}} \begin{matrix} l\_{c1}^2 \\ & \ddots \\ & & l\_{cp}^2 \end{matrix}}\right]^{-1} (\mathbf{x} - \mathbf{x}') \tag{20}$$

$$k(\mathbf{x}, \mathbf{x}') = k\_1(\mathbf{x}, \mathbf{x}') + k\_2(\mathbf{x}, \mathbf{x}') \tag{21}$$
