Next Article in Journal
Thermodynamic Properties of a Regular Black Hole in Gravity Coupling to Nonlinear Electrodynamics
Previous Article in Journal
Game Theoretic Approach for Systematic Feature Selection; Application in False Alarm Detection in Intensive Care Units
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence

College of Liberal Arts and Sciences, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(3), 191; https://doi.org/10.3390/e20030191
Submission received: 21 January 2018 / Revised: 6 March 2018 / Accepted: 12 March 2018 / Published: 12 March 2018
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
In this paper, we examine two widely-used approaches, the polynomial chaos expansion (PCE) and Gaussian process (GP) regression, for the development of surrogate models. The theoretical differences between the PCE and GP approximations are discussed. A state-of-the-art PCE approach is constructed based on high precision quadrature points; however, the need for truncation may result in potential precision loss; the GP approach performs well on small datasets and allows a fine and precise trade-off between fitting the data and smoothing, but its overall performance depends largely on the training dataset. The reproducing kernel Hilbert space (RKHS) and Mercer’s theorem are introduced to form a linkage between the two methods. The theorem has proven that the two surrogates can be embedded in two isomorphic RKHS, by which we propose a novel method named Gaussian process on polynomial chaos basis (GPCB) that incorporates the PCE and GP. A theoretical comparison is made between the PCE and GPCB with the help of the Kullback–Leibler divergence. We present that the GPCB is as stable and accurate as the PCE method. Furthermore, the GPCB is a one-step Bayesian method that chooses the best subset of RKHS in which the true function should lie, while the PCE method requires an adaptive procedure. Simulations of 1D and 2D benchmark functions show that GPCB outperforms both the PCE and classical GP methods. In order to solve high dimensional problems, a random sample scheme with a constructive design (i.e., tensor product of quadrature points) is proposed to generate a valid training dataset for the GPCB method. This approach utilizes the nature of the high numerical accuracy underlying the quadrature points while ensuring the computational feasibility. Finally, the experimental results show that our sample strategy has a higher accuracy than classical experimental designs; meanwhile, it is suitable for solving high dimensional problems.

1. Introduction

Computer simulations are widely used in learning tasks, where a single simulation is an instance of the system [1,2]. A simple approach to the learning task is to randomly sample input variables and run the simulations for each input to obtain the features of the systems. Similar approaches are utilized in Monte Carlo techniques [3]. However, even a single simulation can be computationally costly due to its high complexity, and so, obtaining a trustworthy result via sufficient simulations becomes intractable. Mathematical methods and statistical theorems are introduced to generate surrogate models to replace the simulations, especially when dealing with complex systems with many parameters [4,5]. Although the main drawback of surrogate models is that only approximations can be obtained, they are computationally efficient whilst maintaining the essential information of the systems, hence analyzing the properties of the system. Attempting to construct surrogate models with an acceptable number of simulations necessitates the development of robust techniques to determine their reliability and validity [6,7,8]. Plenty of researchers are working on improving sampling strategies to decrease the number of simulations, which makes the task more significant [9]. With an increasing number of surrogate models being developed, there needs to be a comprehensive understanding of the uncertainties introduced by those models. The main purpose of uncertainty quantification (UQ) is to establish a relationship between input and output, i.e., the propagation of input uncertainties, and then to quantify the difference between surrogate models and original simulations. UQ can provide a measure of the surrogate model’s accuracy and an indication of how to update the model at the same time [10,11,12].
Denote f as a function (or simulator) of the original system, then given experimental design X , the output Y = f ( X ) is produced, where caption notation is used because the input and output are usually vectors (or matrices) in simulations. From a statistical perspective, the input uncertainties are introduced by their randomness, so we represent the input with a random variable x , whose prior probability density function (PDF) is p ( x ) , such as the multivariate Gaussian distribution; as for the output uncertainties, a common technique is to integrate the system uncertainty and the approximation error as a noise term ϵ . In fact, the output y is also a random variable y = f ( x ) + ϵ determined by f , x and ϵ . Now, suppose a surrogate f ¯ ( x ) is constructed to approximate f ( x ) , then UQ is used to identify the distribution and statistical features (for example, Kullback–Leibler divergence) of y , which are essential to the validation and verification of surrogates. Basically, there are two preconditions that need to be satisfied: firstly, the surrogate models are well defined, i.e., any f ¯ is a measurable function with respect to (w.r.t) corresponding probability space p ( x ) ; secondly, techniques are needed that learn from the prior information to obtain the best guess of the true function.
There is a number of studies proposing different surrogates for specific applications in the literature, such as multivariate adaptive regression splines (MARS) [13], support vector regression (SVR) [14], artificial neural network (ANN) [15] for reliability and sensitivity analyses and kriging [16] for structural reliability analysis. We mainly focus on two popular methods that have been extensively studied recently. One popular method that is extensively studied in the literature is the polynomial chaos expansion (PCE), also known as a spectral approach [17]. PCE aims to represent an arbitrary random variable of interest as a spectral expansion function of other random variables with prior PDF. Xiu et al. [18,19,20] have generalized the PCE in terms of the Askey scheme of polynomials, so the surrogates can be expressed by a series of orthogonal polynomials w.r.t the distributions of the input variables. These polynomials can be extended as a basis of a polynomial space. In general, methods used to solve PCE problems are categorized as two types: intrusive and non-intrusive. The main idea behind the intrusive methods is the substitution of the input x and output f ( x ) with the truncated PCE and calculating the coefficients with the help of Galerkin projection [21]. However, the explicit formation of f is required to compose the Galerkin system, and a specific algorithm or program is needed to solve a particular problem. It is for these reasons that the intrusive models are not widely used; non-intrusive methods have been developed to avoid these limitations [21,22]. There are two main aspects of the non-intrusive methods: one is the choice of sampling strategies, for example Monte Carlo techniques; the other one is computational approaches. These two aspects are not independent of each other: for example, if x N ( 0 , 1 ) , then the Gaussian quadrature method is introduced to solve the numerical integration and X is the set of corresponding quadrature points. Another one of the more common methods in constructing surrogate models is the Gaussian process (GP), which is actually a Bayesian approach. Instead of attempting to identify a specific real model of the system, the GP method provides a posterior distribution over the model in order to make robust predictions about the system. As described in the highly influential works [23,24,25,26], the GP can be treated as a distribution over functions with properties controlled by a kernel. For the two prerequisites discussed in the previous paragraph, the GP generates a surrogate model that lies in a space spanned by kernels; meanwhile, Bayesian linear regression or classification methods are introduced to utilize the prior information.
Both the PCE and GP methods build surrogates, but there are some differences between them. The PCE method builds surrogates of a random variable y as a function of another prior random variable x rather than the distribution density function itself. The PCE surrogates are based on the orthogonal polynomial basis corresponding to the p ( x ) , so it is simple to obtain the mean and standard deviation of y . In contrast, the GP utilizes the covariance information so that it performs better in capturing the local features. Although both the PCE and GP approaches are feasible methods to compute the mean and standard deviation of y , the PCE performs more efficiently than the GP method.
As mentioned above, both the PCE and GP methods have their own trade-offs to consider when building surrogates, and there exists a connection to be explored. According to Paul Constantine’s work [27], ordinary kriging (i.e., GP in geostatistics) interpolation can be viewed as a transformed version of the least squares (LS) problem, and the PCE can be viewed as the least squares with selected basis and weights. However, the GP reverts to interpolation when the noise term is zero. When taking the noise term into consideration, the Gaussian process with the kernel (i.e., covariance matrix) X T X can be viewed as a ridge regression problem [28] with a regularization term. Furthermore, different numerical methods can affect the precision of the PCE method, as well. For example, Xiu [20] analyzed the aliasing error w.r.t the projection method and interpolation method. Thus, the inherent connection of the two models cannot be simply summarized as an LS solution, and how to output a model with high precision remains an interesting question.
There are connections between the PCE and GP methods that have been explored by R. Schobi, etc. They introduced a new meta-modeling method naming PC-kriging [29] (polynomial-chaos-based kriging) to solve the problems like rare event estimation [30], structural reliability analysis [31], quantile estimation [32], etc. In their papers, the PCE models can be viewed as a special form of GP where a Dirac function is introduced as the kernel. They also proposed the idea that the PCE models have better performance in capturing the global features and that the GP models approximate the local characteristics. We would like to describe the PC-kriging method as a GP model with a PCE-form trend function along with a noise term. The global features are dominated by the PCE trend, and local structures (residuals) are approximated by the ordinary GP process. The PC-kriging model thus introduces the coefficients as parameters to be optimized, and the solution can be derived by Bayesian linear regression with the basis consisting of the PCE polynomials. They also use the LARSalgorithms to calibrate the model and to select a sparse design. They construct a rigid framework to optimize the parameters, validate and calibrate the model and evaluate the model accuracy.
Unlike the PC-kriging, which takes the PCE as a trend, this paper focuses on the construction of the kernel in the GP to solve the regression problems, through which we can combine the two methods into a unified framework, unifying positive aspects from both and in so doing refining the surrogates. In other words, we wish to find the connection between the GP and the PCE by analyzing the attribution of their solutions, and we want to propose a new approach to achieve high-precision predictions. The main idea of this paper is described as follows. Firstly, the PCE surrogate is embedded in a Hilbert space whose bases are the orthonormal polynomials themselves, then a suitable inner product and a Mercer kernel [33] are defined to build a reproducing kernel Hilbert space (RKHS) [33]. Secondly, on the other hand, the kernel of the GP can be de-composited as the product of eigenfunctions, and we can define an inner product to generate a RKHS, as well. We have explicitly elaborated the two procedures respectively and proven that the two RKHS are isometrically isomorphic. Hence, a connection between these two approaches has been established via RKHS. Furthermore, we can obtain a solution of the PCE model by solving a GP model with the Mercer kernel w.r.t the PCE polynomial basis. We name this approach Gaussian processes on polynomial chaos basis (GPCB). In order to illustrate the capability of the GPCB method, we use the Kullback–Leibler divergence [34] to explicitly compare the PDFs of the posterior prediction of the GPCB and PCE method. Provided that the true function can be approximated by a finite number of PCE bases, it can be concluded that the GPCB can converge to the optimal subset of the RKHS wherein the true function lies.
The experimental design from the PCE model, i.e., the full tensor product of quadratures in each dimension, is used in the GPCB. We have overcome two concerns about the PCE and GP, respectively. Firstly, the PCE is based on a truncated polynomial basis, while the GPCB keeps all polynomials, which can be regarded as maintaining information in every feature. Secondly, the GP’s behavior depends on the experimental design; however, it often achieves the optimal result in local small datasets. The quadrature points derived from the PCE model are distributed evenly in the input space, and those points have high numerical precision w.r.t the polynomial basis; hence, they can work well with the GPCB. However, we must admit that the GPCB is still a GP approach, so when the dimension of input variables grows, the computational burden is on the table. In order to cope with the high dimensional problems, sampling strategies to lower the number of experimental designs are put on the table. The AK-MCSmethod [35] is a useful tool that adaptively selects new experimental designs; however, the experimental design tends to validate the selected surrogate model. We propose a new method that is model-free and that makes full use of the quadrature points. We randomly choose a sparse subset from the quadrature points to form a new experimental design while maintaining the accuracy. Several classical sampling strategies like MC, Halton and LHS are introduced to compare their capabilities. Our sample scheme has superior performance under the conditions in this paper. The GPCB is a novel method to build surrogate models, and it can be used for various physical problems such as reliability analysis and risk assessment.
This paper is divided into two parts. In Part 1, we discuss the mathematical rigor of the method: a brief summary of PCE and GP is presented in Section 2; the reproducing kernel Hilbert space (RKHS) is introduced to connect these two methods in Section 3; the GPCB method is proposed based on the discussion in Section 4; meanwhile, a theoretical Kullback–Leibler divergence between the GPCB and PCE method is demonstrated. In Part 2, an explicit Mehler kernel is presented with the Hermite polynomial basis in the last part of Section 4; several tests of the GPCB with some benchmark functions are presented in Section 5, along with the random constructive sampling method for high dimensional problems.

2. Brief Review of PCE and GP

Firstly, we want to have a clear idea of how the PCE and GP work under the circumstances that, e.g., only samples of input X and output Y are obtained. Different assumptions are made to cope with PCE and GP, respectively, and the processing procedures are presented in the following subsections.

2.1. Polynomial Chaos Expansion

Just as discussed in Section 2.2, the output is assumed to be represented by a model y = f ( x ) + ϵ , where f ( x ) : x Ω R is the real function underneath and ϵ N ( 0 , σ ϵ 2 ) . Here, we define x = ( x 1 , x 2 , , x d ) T as a d-dimensional vector of independent random variables in a bounded domain Ω R d . Suppose { x i , i = 1 , , d } are independent and identically distributed; the joint PDF has the form p ( x ) = i = 1 d p ( x i ) . In the context of PCE, we aim to seek a surrogate of the model f ( x ) as an expansion of a series of orthonormal polynomials ϕ α ( x ) :
f ( x ) = α N d β α ϕ α ( x ) , α = { α 1 , , α d }
where α is the multi-index, ϕ α ( x ) = i = 1 d ϕ α i ( i ) ( x i ) , Ω i ϕ m ( i ) ( x i ) ϕ n ( i ) ( x i ) p i ( x i ) d x i = δ m n with Ω i the marginal domain of Ω and δ m n the Kronecker delta. Xiu et al. [20] have summarized various correspondences between the distribution and polynomial basis to form generalized polynomial chaos.
It is proven that the original model f ( x ) can be approximated to any degree of accuracy in a strong sense [20], e.g., mean-square norm f ( x ) α N D β α ϕ α ( x ) in an L 2 norm defined on Ω , although f is not necessarily the span of orthonormal polynomial bases. Since we are unable to calculate an infinite series, the truncation scheme corresponding to multi-index α is introduced such that we can rearrange the polynomials. For simplicity, we can rewrite Equation (1) in the following form:
f ( x ) l = 0 M β l ϕ l ( x ) f P ,
We can simply solve the above system via the ordinary least squares method or the non-intrusive method. Specifically, we focus on the non-intrusive projection method, whereby we can directly obtain the coefficients by taking the expectation value of Equation (2) multiplied by ϕ l ( x ) :
β l = f ( x ) ϕ l ( x ) p ( x ) d x i = 1 N ω i f ( X i ) ϕ l ( X i ) , l = 0 , , M
where the second equation is derived by the numerical integration techniques, such as the Gaussian quadrature rule, and { X i , i = 1 , , N } and { ω i , i = 1 , , N } are the corresponding nodes and weights. The integration is exact when f ( x ) is of polynomial complexity. Together with Equations (2) and (3) , f P ( x ) has the form:
f P ( x ) l = 0 M i = 1 N ω i f ( X i ) ϕ l ( X i ) ϕ l ( x ) l = 0 M β l ϕ l ( x ) .
{ f ( X i ) , i = 1 , , N } remain unknown to us, and usually, they are substituted by { Y i , i = 1 , , N } . Note Y i = f ( X i ) + ϵ i , so such a substitution will introduce noise into the surrogate; hence, the approximation error is neglected as a source of uncertainty.

2.2. Gaussian Process Regression

The analysis of the Gaussian process regression model [26] is reviewed in this section. A Gaussian prior is placed over function f ( x ) , i.e., f ( x ) GP ( m ( x ) , k ( x , x ) ) , where m is the mean function and k is the kernel function, which is positive semi-definite bounded. More specifically, let X = { X i , i = 1 , , N } Ω N be the input data, and let Y = { Y i , i = 1 , , N } R N be the output data, then we have Y = f ( X ) + ϵ with f ( X ) N ( m ( X ) , k ( X , X ) ) and ϵ N ( 0 , σ ϵ 2 I ) . Bear in mind that the mathematical expression of f ( x ) is implicit, so f ( x ) is approximated to achieve the best guess prediction f G in the statistical sense. With the help of Bayes’ theorem, prediction and corresponding variance at a new point x can be obtained by the following equations [36]:
p G ( f ( x ) | Y , X , x , θ ) = N ( f G ( x ) , c o v ( f G ( x ) ) ) , f G ( x ) E [ f ( x ) | Y , X , x , θ ] = K x T [ K + σ ϵ 2 I ] 1 Y , c o v ( f G ( x ) ) = K xx K x T [ K + σ ϵ 2 I ] 1 K x .
where K = k ( X , X ) R N × N as the covariance matrix with K i j = k ( X i , X j ) and K x = k ( X , x ) R N × 1 , K xx = k ( x , x ) R are defined similarly. Note that Equation (5) shows that the mean value of the posterior distribution can be expressed as a linear combination of N kernel functions as follows:
f G ( x ) = i = 1 N α i k ( X i , x ) , α = ( K + σ ϵ 2 I ) 1 Y

3. Links between the PCE and GP

The basic concepts of PCE and GP are discussed in Section 2. GP generates a surrogate based on Bayes’ theorem and the Gaussian hypothesis; however, it is controlled by the kernel function and the experimental design and usually does not utilize prior distribution information; PCE substitutes the model with orthonormal polynomials, which is more computational efficient, but performs badly when facing noisy or big data. This section aims to build a connection between PCE and GP; hence, they can be studied in the same structure and be combined to improve the performance of the surrogates. The reproducing kernel Hilbert space will be of great help to build such a bridge, and we are going to present it as follows.

3.1. Generate an RKHS from a Mercer Kernel Constructed by the PCE Basis

We have obtained a complete orthonormal basis { ϕ l } of Hilbert space H : = s p a n { ϕ l ( x ) } with inner product < f ( x ) , g ( x ) > = f ( x ) g ( x ) p ( x ) d x in Section 2.1. We can see that the PCE method generates surrogates, which are actually a linear combination of { ϕ l } , so there exists a unique expansion f = l f l ϕ l H . According to Mercer’s theorem [33], we aim to define a kernel having the following form:
k ( x , x ) = l λ l ϕ l ( x ) ϕ l ( x ) s . t . k ( x , x ) < f o r x Ω .
If we have positive weights λ l that satisfy l λ l ϕ l 2 ( x ) < , then for any x Ω , together with the Cauchy–Schwarz inequality, we have:
| f ( x ) | l < f ( x ) , ϕ l ( x ) > 2 λ l l λ l ϕ l 2 ( x ) .
| f ( x ) | is point-wise bounded because f ( x ) H for any x Ω . By checking the right side of the above inequality, the second term is ensured in advance, then f ( x ) lies in a subspace of H such that:
H P = f H | < f , f > H P = l < f ( x ) , ϕ l ( x ) > 2 λ l < , l λ l ϕ l 2 ( x ) < .
Proposition 1.
H P defined in Equation (9) is an RKHS with Mercer kernel defined in Equation (7).

3.2. Generate an RKHS from the Reproducing Kernel Map Construction

We aim to compose a space of functions in which all the GP surrogates are embedded. Given Equation (6) and an arbitrary experimental design X , define a space of functions as follows:
H G = f ( x ) = i = 1 N f i k ( x , X i ) | N N , X Ω N , x Ω , f i R , i = 1 N j = 1 N f i f j k ( X i , X j ) < + .
Proposition 2.
H G is a pre-Hilbert space with the inner product < · , · > H G
Now that H G is a pre-Hilbert space and given the norm f ( x ) H G = < f ( x ) , f ( x ) > H G , we can define a closure of H G as H G derived by the classical Hilbert space theory. This is an abstract space where the norm of H G extends to the closure H G . Thus, we have a Hilbert space H G .
Proposition 3.
H G defined above is the unique RKHS of the kernel k ( · , · ) .

3.3. Reproducing Kernel Hilbert Spaces as a Linkage

H P and H G are RKHS with the Mercer kernel and GP kernel, respectively. We are going to investigate the relationship between the two RKHS, by which we can discuss the two approaches in a unified structure. Let X be a sample set and GP kernel k ( · , · ) be a real positive semi-definite kernel, then according to Mercer’s theorem, k ( X i , X j ) has an eigenfunction expansion:
k ( X i , X j ) = l λ l ϕ l ( X i ) ϕ l ( X j ) ,
where the eigenfunctions { ϕ l } are orthonormal, i.e., < ϕ l , ϕ l > = ϕ l ( x ) ϕ l ( x ) p ( x ) d x = δ l l and { λ l , ϕ l } satisfies l λ l ϕ l 2 ( x ) < . Let f X ( x ) H G with experimental design X , then we can rewrite it according to Equations (10) and (11):
f X ( x ) = i = 1 N f i l λ l ϕ l ( X i ) ϕ l ( x ) = l λ l i = 1 N f i ϕ l ( X i ) ϕ l ( x ) l c l ( X ) ϕ l ( x ) .
where c l ( X ) is identically determined by l and X , and it has a similar form as a function lies in H P . Actually, given f X ( x ) , g X ( x ) H G , we have:
< f X ( x ) , g X ( x ) > H G = i = 1 N j = 1 N f i g j l λ l ϕ l ( X i ) λ l ϕ l ( X j ) / λ l = i λ l i = 1 N f i ϕ l ( X i ) λ l j = 1 N g j ϕ l ( X j ) / λ l = l c l ( X ) c l ( X ) / λ l = l < f X ( x ) , ϕ l > < g X ( x ) , ϕ l > / λ l = < f X ( x ) , g X ( x ) > H P .
The above equation gives us the information that their inner product stands in H P , as well, so we can conclude that f X lies in H P . It also shows us that the two inner products are equivalent. Next, we are going to propose a rigid theorem to prove that the two spaces are isometrically isomorphic.
Theorem 1.
The reproducing kernel Hilbert space H G of a given kernel k is isometrically isomorphic to the space H P .
According to the proof of Theorem 1 in Appendix A.4, it is reasonable to introduce a weighted l 2 space l 1 / λ 2 because it is difficult to find a direct linear map between H G (where c l varies according to X and k) and H P (where c l varies according to the distribution of x ). Figure 1 shows two flowcharts used to describe different processes in generating the RKHS.
Furthermore, the GP prediction is a combination of the kernel functions, which consist of infinite eigenfunctions, while the PCE prediction is always a combination of finite polynomial bases. The Kullback–Leibler divergence (KL divergence) is a useful criterion to indicate the performance of different surrogate models. We are going to present the comparison of the GPCB and PCE methods with the help of KL divergence in the next section.

4. Gaussian Process on Polynomial Chaos Basis

H G and H P are isomorphic as discussed in previous Section 3, so it is natural to come up with the idea that GP can be conducted with k ( · , · ) as the Mercer Kernel generated by polynomial basis in the PCE, and the new model is called Gaussian process on polynomial chaos basis (GPCB). In fact, the GPCB generates a PCE-like model, but with a different philosophy. Note that the posterior distribution of the predictions regarding experimental design { X , Y } can be calculated analytically, so we are able to compute the KL divergence as well, which are presented as follows.

4.1. Comparison of the PCE and GPCB with the Kullback–Leibler Divergence

The true distribution of the system is always implicit in practice. Without loss of generality, the underlying true system is assumed to be f P ¯ ( x ) = l = 0 M ¯ β ¯ l ϕ l ( x ) if f P ¯ ( x ) C 0 ( Ω ¯ ) such that it can be approximated by the polynomials to any degree of accuracy [20].
Firstly, we presume that M ¯ M , i.e., β in Equation (4) is an unbiased estimator of β ¯ . Hence, the PCE approximation can be considered as a precise approximate of the true function. We compare the performance of the GPCB and the PCE method by comparing their difference in the posterior distribution of the prediction. It is known that given experimental design { X , Y } and kernel function k ( x , x ) = l = 0 λ l ϕ l ( x ) ϕ l ( x ) , the distribution of the prediction of the GPCB reads:
p G ( f G ( x ) ) = N ( K x T K Y 1 Y , K xx K x T K Y 1 K x ) N ( μ 1 , Σ 1 ) ,
where conditions X , Y , x are dropped in p G ( f G ( x ) | X , Y , x ) for simplicity and K Y = K + σ 2 I . Similarly, the prediction of the PCE with the projection method is f P ( x ) = ϕ Φ T W Y , which is derived from the estimation β = Φ T W Y . Here, ϕ = ϕ ( x ) R 1 × ( P + 1 ) , Φ = ϕ ( X ) R N × ( P + 1 ) , W = d i a g { ω 1 , , ω N } is a diagonal matrix. The corresponding prediction variance is c o v ( f P ( x ) ) = ϕ c o v ( β ) ϕ T = σ 2 ϕ Φ T W 2 Φ ϕ T . Dropping the conditions in p P ( f P ( x ) | X , Y , x ) as well, the previous results indicate:
p P ( f P ( x ) ) = N ( ϕ Φ T W Y , σ 2 ϕ Φ T W 2 Φ ϕ T ) N ( μ 2 , Σ 2 ) .
We can evaluate the discrepancy between p G ( f G ( x ) ) and p P ( f P ( x ) ) , hence comparing their performance. The KL divergence can be calculated analytically:
D K L p P , p G = 1 2 1 + Σ 2 Σ 1 + log Σ 1 Σ 2 + μ 1 μ 2 2 Σ 1 1 2 1 + b log b + a 2 Σ 2 b ,
where a , b are simplified notations for the corresponding parts in Equation (16). In fact, b is the point-wise ratio between the posterior variances of the predictions, and a represents the difference between the posterior mean of the prediction. We discuss the properties of D K L starting from a special case to the general conditions hereafter.
Let k be a truncated kernel with the M basis of PCE, i.e., k ( x , x ) = l = 0 M λ l ϕ l ( x ) ϕ l ( x ) . Actually, it can be seen as the assignment of { λ l , l > M } with the value of zero, which can be achieved by optimizing the value λ l with a specific procedure. b can be simplified as:
b = σ 2 ϕ Φ T W 2 Φ ϕ T ϕ Λ Λ Φ T K Y 1 Φ Λ ϕ T = ϕ Φ T W 2 Φ ϕ T ϕ Φ T W K Y 1 K W Φ ϕ T = ϕ Φ T W 2 Φ ϕ T ϕ Φ T W U ( S + σ 2 I ) 1 U T U S U T W Φ ϕ T s max + σ ϵ 2 s max , s min + σ ϵ 2 s min ,
where Λ = d i a g { λ l } is a diagonal matrix and K = U S U T is the eigenvalue decomposition. Let s max be the maximum eigenvalue and s min be the minimal one; the above interval holds because K is a positive definite matrix. Note that b is an invariant with fixed x . On the other hand, the distribution of a is as follows:
a = ϕ Φ T W Y Λ Φ T K Y 1 Y = σ ϵ 2 ϕ Φ T W K Y 1 Y N σ ϵ 2 ϕ Φ T W K Y 1 Y ^ , σ ϵ 6 ϕ Φ T W K Y 1 K Y 1 W Φ ϕ T .
Here, Y ^ denotes the mean value of the observations, i.e., the true response. It is necessary to state that the randomness of a is brought by the random variable ϵ in observation Y. In fact, we have the expectation of D K L ( p P , p G ) as follows:
E ϵ D K L ( p P , p G ) = 1 2 1 + b log b + E ϵ a 2 Σ 2 b = 1 2 1 + b log b + ( v a r ( c ) + ( E ϵ a ) 2 ) b / Σ 2 1 2 1 + b log b + σ ϵ 2 + Y ^ 2 σ ϵ 2 σ ϵ 2 s min + σ ϵ 2 2 b .
Presume that the observation Y is normalized, as well as Y ^ ; hence, Y ^ 2 can be estimated as O ( 1 ) . The main difference is affected by the kernel k ( · , · ) (or Λ ) and the term σ ϵ 2 . More specifically, The GPCB can achieve a smaller variance than the PCE method in a point-wise manner because b > 1 , and the differences between the predictions of the two methods is of the order of σ ϵ 2 . Furthermore, if σ ϵ 2 is sufficient small, i.e., σ ϵ 2 s min , we have b 1 , thus E ϵ D K L ( p P , p G ) 0 . If σ ϵ 2 = 0 , i.e., we investigate the noise-free models, then b = 1 and a = 0 , which enforces Σ 1 = Σ 2 and μ 1 = μ 2 , respectively. This means that p G and p P are identical distributions, i.e., D K L p P , p G = 0 .
We can conclude that the expected value of D K L ( p P , p G ) is bounded by a certain constant, which mainly depends on the σ ϵ 2 and Λ . In other words, since σ ϵ 2 is given in the prior, and the Λ are optimized; hence, the D K L ( p P , p G ) is constrained, which means that the GPCB is as stable as the PCE method. Nonetheless, if f P has reached a desired prediction precision, then f G with the kernel constructed with the same basis can have a desirable precision, as well as smaller variance.
Secondly, we consider that M ¯ > M , where the β of the PCE method is not an unbiased estimate of β ¯ any longer. Let p P ¯ denote the PCE approximation with the M ¯ basis; under the circumstance, k ( x , x ) = k = 0 M ¯ λ l ϕ l ( x ) ϕ l ( x ) is achieved by tuning the value of Λ via a certain learning method; hence, D K L ( p P ¯ , p G ) is also bounded by a constant according to Equation (19), i.e., the GPCB can converge to the precise PCE prediction p P ¯ , as well. However, the KL divergence D K L ( p P ¯ , p P ) is given as:
D K L p P ¯ , p P = 1 2 1 + Σ ¯ Σ 2 log Σ ¯ Σ 2 + μ ¯ μ 2 2 Σ 1 1 2 1 + b ¯ log b ¯ + a ¯ 2 Σ 1 , b ¯ = ϕ Φ T W 2 Φ ϕ T + ϕ r Φ r T W 2 Φ r ϕ r T + ϕ A ϕ r T ϕ Φ T W 2 Φ ϕ T 1 , a ¯ = ϕ r Φ r T W Y N ϕ r Φ r T W Y ^ , σ ϵ 2 ϕ r Φ r T W 2 Φ r ϕ r T .
We denote ϕ r as the basis that belongs to the model f P ¯ , but f P . It is shown that the biased PCE f P has smaller variance, however with a bias whose mean value depends on ϕ r . We notice that ϕ r represents the high-order polynomials; hence, the bias can be considerably large for general cases, and so is the D K L p P ¯ , p P . We can conclude that even though the biased PCE has smaller variance, the relatively large bias can lead to a false prediction.
In fact, if the underlying system is smooth enough to be modeled by a polynomial approximation, then we can adaptively increase the number of polynomial bases (and the experimental designs if necessary) to reach a precise approximation. However, on the other hand, we can directly use the GPCB method, which is a one-step Bayesian approximation, that converges to the hypothetical true system f P ¯ . Roughly speaking, the GPCB finds M ¯ automatically by tuning the parameters Λ instead of adaptively changing the value of P in the PCE method. It indeed provides more convenience for computation. The key problem is the evaluation of Λ . Specifically, we introduce the Mehler kernel [37], which is an analytic expression of the Mercer kernel constructed by Hermite polynomials, and discuss the learning procedure of Λ l .

4.2. Construction of the Kernel with Hermite Polynomials

Recall that we have { ϕ α } in PCE as an orthonormal basis, then we regard them as eigenfunctions of a kernel k ( · , · ) . Since p ( x ) follows the standard Gaussian distribution, then ϕ l ( i ) ( x i ) = H e l ( x i ) / l ! , where x i is the i-th variable of x and H e l ( x i ) is the Hermite polynomial of degree l:
H e l ( x i ) = ( 1 ) l e x i 2 2 d l d x i l e x i 2 2 .
Here, we denote the real l-th multi-index of the l-th polynomial in Equation (2) as α ( l ) = ( α 1 ( l ) , , α d ( l ) ) , | α ( l ) | = α 1 ( l ) + + α d ( l ) and α ( l ) ! = α 1 ( l ) ! α d ( l ) ! , then according to the previous analysis, we can get ϕ α ( l ) ( x ) = He α ( l ) ( x ) / α ( l ) ! . We have Mehler kernel M e ( X , X ) [37] with orthonormal Hermite polynomials as eigenfunctions:
M e ( x , x ) = exp D x D x T + D x D x T 2 exp i = 1 d ρ i x i x i = l = 0 ρ α ( l ) ϕ α ( l ) ( x ) ϕ α ( l ) ( x ) ,
where the eigenvalue is λ α ( l ) = ρ α ( l ) = i = 1 d ρ i α i ( l ) for parameter ρ , D x is the symbol representing the row gradient operator, i.e., D x = ( / x 1 , , / x d ) , and D x is defined similarly. Specifically, in the one-dimensional case:
M e ( x , x ) = 1 1 ρ 2 exp ρ 2 ( x 2 + x 2 ) 2 ρ x x 2 ( 1 ρ 2 ) = l = 0 ρ l ϕ l ( x ) ϕ l ( x ) ,
where the eigenvalue λ l = ρ l > 0 . The truncated kernel M e M ( x , x ) = l = 0 M ρ l ϕ l ( x ) ϕ l ( x ) , and its attribution can be investigated by varying ρ and M. Figure 2a illustrates the truncated kernel M e M ( x , x ) , which shows that M e M ( x , x ) tends to converge to M e ( x , x ) as M grows. Figure 2b shows the values of M e ( x , 0.8 ) with different ρ . It presents to us that the influence of eigenvalue λ l is greater on the Mehler kernel.

4.3. Learning the Hyper-Parameter ρ of M e ( x , x )

It is clear that λ l = ρ l has a great impact on the kernel values, hence affecting the convergence of the f G ( x ) . We start with a simple example, where f ( x ) = 5 + x + exp ( x ) , x N ( 0 , 2 2 ) is the true underlying function, and the noise term is ignored in the observation. Considering the Taylor expansion of exp ( x ) , f ( x ) can be approximated by the PCE with sufficiently large M. In fact, we can calculate the projection < f ( x ) , ϕ l ( x ) > to seek the value of M. When l > M , < f ( x ) , ϕ l ( x ) > 0 . On the other hand, as discussed in Section 4.1, the GPCB can find M automatically by tuning the hyper-parameter ρ . Let the experimental design X be the zeros of H e 10 ( x ) of Equation (21), i.e., quadrature points corresponding to degree 10; we compare the performance of the GPCB with ρ equal to 0.1 , 0.45 , 0.7 , respectively. The results are displayed in Figure 3. Note that the projection value is the absolute value of the true value in the figure for better illustration. It shows that we are able to approximate f ( x ) with polynomials up to degree 40. If ρ = 0.45 for the Mehler kernel, the GPCB almost converges to exact f ( x ) , whereas ρ = 0.1 leads to a fast convergence rate and ρ = 0.7 results in a slow convergence rate.
Figure 3 shows that the ρ has a crucial impact on the performance of the GPCB method, so a tractable method to optimize ρ is needed. A natural criterion is the KL divergence, which can be minimized by finding the optimal hyper-parameters ρ . We discussed the KL divergence of the GPCB and the PCE surrogates under the assumption that the PCE surrogate model can approximate the true system to any degree of accuracy. However, the distribution of the real system is usually unknown, which makes the calculation of KL divergence intractable. In fact, it can be easily deduced that the minimization of KL divergence is equivalent to minimizing the negative log marginal likelihood Δ , which (actually is 2 Δ ) reads:
Δ = Y T K Y 1 Y + log ( 2 π ) N | K Y |
It is important to optimize ρ and σ ϵ 2 to obtain a suitable kernel to get an accurate approximation. Classical methods like gradient-based techniques can be used to search for the optimal ρ ; however, it may perform poorly because it is locally optimized. As we can see in Equation (23), it is indicated that ρ should take a value between zero and one, so we can propose a global method to solve our optimization problem. The algorithm for generating a GPCB approximation is given in Algorithm 1: Entropy 20 00191 i001

5. Numerical Investigation

In this section, we investigate the GPCB method for various benchmark functions. Firstly, we investigate the same example in Figure 3; however, the noise term is considered, i.e., y = f ( x ) + ϵ = 5 + x + exp ( x ) + ϵ is the observation, where x N ( 0 , 2 2 ) , ϵ N ( 0 , 0.1 2 ) . Three methods, i.e., GP with the RBF kernel, PCE and GPCB, are implemented. It is necessary to note that the Monte Carlo (MC) sampling strategy is used in the normal GP approaches, and Gaussian quadrature points are introduced in the GPCB approach. The main reason is that the quadrature points are too sparse for the widely-used kernels to capture the local features. For example, in this case, P = 10 in PCE, then the maximum quadrature point is 10.376 , which is beyond 3 σ x . In the first set of experiments, let P = 10 in PCE, which means 11 sample points are used in the experiments; furthermore, 10,000 samples are introduced as test dataset to output the ECDF (empirical cumulative distribution function) and RMSE (root-mean-squared error). The GP algorithm is implemented by the gpml toolbox [38] written in MATLAB with four different kernels, i.e., linear, quadratic, Gaussian and Matérn-3/2 kernels. The comparisons of the results are displayed in Figure 4.
Figure 4a illustrates the point-wise KL divergence between the true value of f ( x ) and the predictions on the interval [ 4 , 4 ] based on Equation (16). It is clear that the distribution of GPCB prediction is statistically closest to the true response, although the GP methods with quadratic, Gaussian and Matérn-3/2 kernels outperform the GPCB at some points. Figure 4b compares the ECDF of y based on the test dataset. It shows that both the PCE and GPCB have a similar ECDF with the true value. Upon closer inspection, which is shown in the magnified subregion, it is obvious that the ECDF of the GPCB is almost exactly the same as the real ECDF, which shows that GPCB has actually captured the feature of f ( x ) with high precision. At the same time, the RMSEs of the PCE, linear kernel, quadratic kernel, Gaussian kernel, Matérn-3/2 kernel and GPCB are 1.0570 , 37.3526 , 23.6383 , 25.6044 , 27.4489 , 0.4108 , respectively. We have implemented another experiment, which uses the degree P (i.e., the number of experimental design) as the second set of experiments, which are illustrated in Figure 4c. Figure 4c shows that GPCB generally outperforms the ordinary GP with the RBF kernel, which indicates that GPCB performs better with a few (or sparse) training points. It is notable that the PCE and GPCB perform with almost the same precision when the degree is greater than 16. It echoes the idea that PCE and GPCB are statistical equivalent, as we present in Section 4.1.
Similar experiments are conducted with a two-dimensional function, which is expressed as f ( x ) = exp ( x 1 ) / exp ( x 2 ) . Let x 1 , x 2 N ( 0 , 1 ) , y = f ( x ) + ϵ be the real model where ϵ is an independent noise term with a normal distribution N ( 0 , 0.1 2 I 2 ) . Unlike the first test function, this test example is a limit state function. Let the maximum degree for each dimension p t be seven for the PCE method, which makes 64 training points in total. Another dataset of 10,000 independent samples is introduced as the test set to calculate the ECDF and RMSE, as well. Similarly, we have the point-wise KL divergence in the region [ 2 , 2 ; 2 , 2 ] as shown in Figure 5a. It is clear that the GPCB is globally closer to the true distribution than other methods. The GP with quadratic, Gaussian and Matérn-3/2 kernels can approximate the center part well, while the PCE does not seem to perform as well. Figure 5b shows that the six methods except the linear kernel are able to reconstruct the distribution of the prediction, and upon closer observation, we find that the ECDF of the GPCB and Matérn-3/2 kernel are the best approximations among the six methods. We also consider another set of experiments focusing on the number of experimental designs, which equals ( p t + 1 ) 2 for the 2D function. The RMSEs of the three methods with respect to different p t are displayed in Figure 5c. It also shows that the PCE and GPCB generally outperform the normal GP approaches, and the GPCB has the best performance.
To summarize, the GPCB generates a surrogate of infinite series, while the PCE can only generate a surrogate with up to P + 1 polynomials, and they tend to behave with similar precision when P is large enough. A set of sparse quadrature points sampled in the PCE, which are derived from the Gaussian quadrature rule, is a good design for the GPCB. The GPCB with those training points generally performs better than the normal GP methods and PCE. However, the size of such a training set grows dramatically with the dimension ( N = ( p t + 1 ) d in total), so it is not practical in real-life applications. We aim to present a strategy of sampling from those quadrature points, namely candidate points in the next section, and analyze the performance of our algorithm on the selected points.

5.1. A Random Constructive Design in High Dimensional Problems

As the dimension of a system grows, so do the number of design points of PCE due to a tensor product of quadrature points in each dimension. It is possible that PCE could deal with thousands of points of training data with acceptable computational time; however, it becomes expensive for GP approaches, including our GPCB approach. Monte Carlo sampling techniques can substitute quadrature design; however, these are not always stable. Other sampling strategies like Halton sampling and Latin hypercube sampling [39] are widely used.
In this work, we want to utilize the high accuracy of quadrature points and also want to reduce the massive number of points. Let x R d , and p t is the maximum degree in each dimension, so p t + 1 quadrature points are needed in each dimension, which makes the total number of tensor products of quadrature points be # { X c } = ( p t + 1 ) d . We seek to find a subset of the candidate design X c . Furthermore, we wish to obtain a subset having a good coverage rate in the space. Therefore, we proposed the random definite design in our paper. Note that the LHS design can be extended to a larger interval ( 1 , ( p t + 1 ) D ) and can produce points at midpoints (endpoints), so we use the LHS design to sample N indices from the interval. More specifically, we presume that the points in X c are equally important, so we arrange those points with a certain order to get their indices. Then, we sample from the indices with the LHS design, and each index is related to a certain quadrature point. It can be easily implemented by the MATLAB built-in function lhsdesign. The corresponding N points are what we need.
Take a three-dimensional input space as an example, where x i N ( 0 , 1 ) , i = 1 , 2 , 3 . Set p t = 6 , then # { X c } = 343 . The candidate design and its subset of 50 points X are illustrated in Figure 6. We can see from the figure that our sampling is sparse in the whole set of candidate points, and it behaves uniformly in dimension one as illustrated in Figure 6b, with similar conclusions in the other two dimensions. When projecting our sampling from dimension three to get Figure 6c, we can see that the selected points almost cover every point of X c , which means it has all features in dimensions one and two, i.e., quadrature point values of the two dimensions. It shows that such a method can generate a sparse subset meanwhile guaranteeing the coverage rate in the whole candidate design. We name it the random constructive design.
Now, we want to find out whether these samples retain their capability of accuracy. Firstly, we use the PCE method to test those samples. We will look into the benchmark Ishigami function [40]: f ( x ) = sin ( x 1 ) + 7 sin 2 ( x 2 ) + 0.1 x 3 4 sin ( x 1 ) , where four different sampling strategies are compared here. Set p t = 15 , and the candidate design X c has a size of 4096. The error term ϵ is eliminated in this simulation for the accuracy test. The RMSE are computed on 10,000 independently-sampled data, and the results are presented below in Figure 7. It can be seen that our sample always performs better than other samples. When the number of samples surpasses 900, the RMSE becomes 1.0605 × 10 5 , which equals the RMSE with the whole candidate points. Therefore, we only select 20 % of X c and get the same precision. Furthermore, if we set our precision to be 10 2 , only 400 points are needed. This shows that the quadrature points have high precision in numerical calculation. In other words, the points in the candidate set are good points.
Then, the random constructive design is used with the PCE, GP and GPCB methods for the Ishigami function, with the noise term added in the observations. We take the RMSE as a criterion to compare their performance, and the results are illustrated in Figure 8. Figure 8a shows that the GPCB is always better than the PCE method, and they tend to behave the same. However, as the number of sampling points grows, the GP with the quadratic kernel, Gaussian kernel and Matérn-3/2 kernel generally outperform the other methods. We can see that the Ishigami function is a bounded function; therefore, it is likely to fill the whole observation space as the number of samples increases, hence improving the accuracy of the GP method. We plot the ECDF with respect to the three methods when N = 1000 in Figure 8b. It is clear that the GP with the quadratic kernel, Gaussian kernel and Matérn-3/2 kernel can almost recover the true distribution of the response, which is beyond the capability of the PCE and GPCB.
A six-dimensional problem is being tested with the G-function [41], which is not like the Ishigami function and is unbounded in the domain [ , ] 6 :
f ( x ) = i = 1 6 | 4 x i 2 | + a i 1 + a i , w h e r e a i = i 2 2 , i = 1 , , 6
The experiment is performed with the same approaches, and the results are shown below in Figure 9. Figure 9a shows that the GPCB outperforms the PCE and GP with the Gaussian Kernel, and it has similar precision with the quadratic and Matérn-3/2 kernels. We notice that the GPCB is more stable than the PCE method, which behaves badly especially when N = 150 , 300 . Figure 9b shows that none of these three methods can reconstruct the probability of y very well; however, we can note that the GPCB is still comparatively the closest.
Finally, we are going to present a more complicated model with 15 dimensional functions with the following form:
f ( x ) = a 1 T x + a 2 T sin ( x ) + a 3 T cos ( x ) + x T M x .
The distribution of x is the product of 15 independent distributions, i.e., x i N ( 0 , 1 ) , i = 1 , , 15 . This function is introduced by the work of O’Hagan, where a 1 , a 2 , a 3 , M are defined in [42]. We can see that this function is dominated by the linear and quadratic term, so it may be well approximated by the low-order PCE model. Let p t = 3 in the PCE model; we can see from Figure 10a that the GP with the quadratic kernel performs best among the six methods, while the PCE performs better than the GPCB and other GP methods. On the other hand, the GPCB is always generally better than the GP method except with the quadratic kernel for this function. When N = 1000 , the PCE can generate y, which follows the real distribution according to the ECDF in Figure 10b.

6. Conclusions

This paper has examined two different surrogates of computational models, i.e., polynomial chaos expansion and Gaussian process regression. First, we present a brief review of these two approaches. Next, we discuss the relationship between PCE and GP and find that PCE and GP surrogates are embedded in two isomorphic RKHS. Mercer’s theorem is introduced to generate a kernel based on a PCE basis, by which a new approach is proposed, which we name GPCB. An example shows that with the same experimental design, GPCB tends to retain useful information in a suitable subspace of the RKHS by changing the hyper-parameters, whereas PCE simply sets the information of the residual to zero. We further investigate the approximation performance on two test functions in 1D and 3D, respectively, and their approximation properties are illustrated. In order to deal with the high dimensional scenario, a random constructive design from the quadrature points is used to generate an experimental design. The results give us several directions for choosing models: basically, the GPCB outperforms the PCE, but when the original model can be well approximated by low-order PCE (Figure 10), it seems cumbersome to introduce the GPCB and GP; when the response function is bounded (Figure 8), if we have enough training resources, the GP can be a better choice; when the objective function is unbounded (Figure 4 and Figure 9) or cannot be approximated by finite polynomials (Figure 5), we should probably choose the GPCB.
Future work can extend the family of the Mercer kernel or equivalent kernel (other than the Mehler kernel presented in this paper) beyond a classical approximation method. We can also analyze the experimental design for GP regression in many ways. Although we can see that our sampling method behaves fair enough in the experiments, there is also the opportunity to discover further suitable experimental design schemes to fit different computational purposes, which would be of great interest. The stability of our method will be investigated in future work, i.e., how many points are needed to train a good surrogate and whether our method always produces a suitable design. Furthermore, we can establish closer connections between numerical analysis and statistics via such combinations.

Acknowledgments

This work is supported by the program for New Century Excellent Talents in University, State Education Ministry in China (No. NCET 10-0893) and the National Science Foundation of China (No. 11771450, No. 61573367).

Author Contributions

Liang Yan proposed the original idea, implemented the experiments in the work and wrote the paper. Xiaojun Duan contributed to the theoretical analysis and simulation designs. Bowen Liu partially undertook the writing and simulation work. All authors read and approved the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Proposition 1, Proposition 3, Proposition 2 and Theorem 1

Here, we use the same notations as in Section 3.

Appendix A.1. Proof of Proposition 1

Proof. 
Define the inner product in the above subspace H P as follows:
< f ( x ) , g ( x ) > H P = l f l g l / λ l .
Firstly, it is obvious that H P is a Hilbert space in H . Secondly, for any x Ω , k ( x , · ) belongs to H P because:
< k ( x , · ) , k ( x , · ) > H P = l < k ( x , · ) , ϕ l ( · ) > 2 λ l = i λ l ϕ l 2 ( x ) < .
It also has the reproducing property for:
< f ( · ) , k ( x , · ) > H P = l f l λ l ϕ l ( x ) / λ l = f ( x ) f o r x Ω ,
We have the conclusion that H P is an RKHS derived from the Mercer kernel. □

Appendix A.2. Proof of Proposition 2

Proof. 
In fact, H G is a space of all finite linear combinations of functions k ( x , · ) : Ω R , so we can denote H G : = s p a n { k ( x , · ) | x Ω } , and the elements in H G have the general form of f ( x ) = i = 1 N f i k ( x , X i ) . Therefore, different N and all experimental designs X are allowed, which enables that f ( x ) = i = 1 N f i k ( x , X i ) , g ( x ) = i = 1 N f i k ( x , X i ) H G . The linearity of H G is given by the following explanation.
Let t 1 , t 2 R be scalars, then we can rewrite t 1 f ( 1 ) ( x ) + t 2 f ( 2 ) ( x ) as a function f ( x ) such that:
f ( x ) = i = 1 N 1 t 1 f i ( 1 ) k ( x , X i ( 1 ) ) + j = 1 N 2 t 2 f j ( 2 ) k ( x , X j ( 2 ) ) l = 1 N f l k ( x , X l ) ,
where N = N 1 + N 2 , { f l , l = 1 , , N } = { t 1 f i ( 1 ) , i = 1 , , N 1 } { t 2 f j ( 2 ) , j = 1 , , N 2 } , X = X ( 1 ) X ( 2 ) . Additionally, we have:
l = 1 N m = 1 N f l f m k ( X l , X m ) = i = 1 N 1 i = 1 N 1 t 1 2 f i ( 1 ) f i ( 1 ) k ( X i ( 1 ) , X i ( 1 ) ) + j = 1 N 2 j = 1 N 2 t 2 2 f j ( 2 ) f j ( 2 ) k ( X j ( 2 ) , X j ( 2 ) ) + 2 i = 1 N 1 j = 1 N 2 t 1 t 2 f i ( 1 ) f j ( 2 ) k ( X i ( 1 ) , X j ( 2 ) ) < + ,
which means that t 1 f ( 1 ) ( x ) + t 2 f ( 2 ) ( x ) also belongs to H G . Then, we would like to show that H G is an inner product space. Let the kernel function be positive semi-definite, and we define the inner product of H G as follows:
< f ( x ) , g ( x ) > H G = i = 1 N j = 1 N f i g j k ( X i , X j ) .
< · , · > H G is a well-defined inner product by checking the following conditions:
1.
Symmetry: < f , g > H G = i , j f i g j k ( X i , X j ) = j , i g j f i k ( X j , X i ) = < g , f > H G ;
2.
Bi-linearity:
< t 1 f ( 1 ) + t 2 f ( 2 ) , g > H G = l = 1 N j = 1 N f l g j k ( X l , X j ) = a i = 1 N 1 j = 1 N f i ( 1 ) g j k ( X i ( 1 ) , X j ) + b i = 1 N 2 j = 1 N f i ( 2 ) g j k ( X i ( 2 ) , X j ) = a < f ( 1 ) , g > H G + b < f ( 2 ) , g > H G ;
3.
Positive-definiteness: It is obvious that < f , f > H G = f T K f 0 with the equality iff f = 0 .
 □

Appendix A.3. Proofs of Proposition 3

Proof. 
Firstly, we can prove that the reproducing formula holds for the space H G . For any X, k ( X , x ) is a function of x and belongs to H G . Furthermore, we have:
< f ( · ) , k ( x , · ) > H G = i = 1 N f i k ( x , X i ) = f ( x ) .
The above reproducing property is valid for any f H G ; thus, it is still valid for the closure H G in the sense of generalizing the above equation as < f ( · ) , k ( x , · ) > H G = f ( x ) . Then, we need to prove that H G is unique. Suppose that we have another Hilbert space H G , which is possibly an RKHS of the kernel k ( · , · ) , then, for a specific X , we can get:
< k ( x , X ) , k ( x , X ) > H G = k ( x , x ) = < k ( x , X ) , k ( x , X ) > H G .
This proves that the two inner products are the same on H G ; then, H G must contain H G because it is the closure of H G . H G must be equivalent to H G , otherwise we can find a nonzero element f H G H G such that it is orthogonal to H G . However, we can always get that f = < f , k ( · , X ) > H G 0 for a particular X , which is a contradiction. □

Appendix A.4. Proofs of Theorem 1

Proof. 
Define a weighted l 2 space such that:
l λ 2 = h | < h , h > l λ 2 = l λ l h l 2 < .
It is clear that { c l ( X ) } l 1 / λ 2 for c l ( X ) defined in Equation (12). l 1 / λ 2 is the completion of the span of all { c i ( X ) } , then H G is isometrically isomorphic to l 1 / λ 2 , because firstly, c l ( X ) is identically determined by k ( · , · ) and X , secondly, according to Equation (13), < f X ( x ) , g X ( x ) > H G = l c l ( X ) c l ( X ) / λ l . On the other hand, there exists a linear map such that:
T : l 1 / λ 2 H P , T ( c ) = l c l ϕ l .
This is a surjective map for every f H P , { f i } l 1 / λ 2 . Now, we need to prove that the projection T is injective. Assume there exist c and c such that T ( c ) = T ( c ) , then:
0 = T ( c ) T ( c ) H P 2 = < T ( c c ) , T ( c c ) > H P = l ( c i c i ) 2 / λ l < ,
which proves c = c . Meanwhile,
< f ( x ) , g ( x ) > H P = l f l g l / λ l , { f l } , { g l } l 1 / λ 2 .
The inner products remain equal, so it is also clear that H P is isometrically isomorphic to l 1 / λ 2 .
To summarize, we can say that H G and H P are isomorphic by establishing a Hilbert space l 1 / λ 2 to connect them; or it can be said that PCE and GP with the Mercer kernel generate surrogates in the same Hilbert space.
 □

References

  1. Schwefel, H.P.P. Evolution and Optimum Seeking: The Sixth Generation; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
  2. Santner, T.J.; Williams, B.J.; Notz, W.I. The Design and Analysis of Computer Experiments; Springer Science & Business Media: Berlin, Germany, 2013. [Google Scholar]
  3. Hurtado, J.; Barbat, A.H. Monte Carlo techniques in computational stochastic mechanics. Arch. Comput. Methods Eng. 1998, 5, 3–29. [Google Scholar] [CrossRef]
  4. Conti, S.; O’Hagan, A. Bayesian emulation of complex multi-output and dynamic computer models. J. Stat. Plan. Inference 2010, 140, 640–651. [Google Scholar] [CrossRef]
  5. Higdon, D.; Gattiker, J.; Williams, B.; Rightley, M. Computer model calibration using high-dimensional output. J. Am. Stat. Assoc. 2008, 103, 570–583. [Google Scholar] [CrossRef]
  6. Balci, O. Verification, validation, and certification of modeling and simulation applications. In Proceedings of the 35th Conference on Winter Simulation: Driving Innovation, New Orleans, LA, USA, 7–10 December 2003. [Google Scholar]
  7. Rubino, G.; Tuffin, B. Rare Event Simulation Using Monte Carlo Methods; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2009. [Google Scholar]
  8. Sundar, V.; Shields, M.D. Surrogate-enhanced stochastic search algorithms to identify implicitly defined functions for reliability analysis. Struct. Saf. 2016, 62, 1–11. [Google Scholar] [CrossRef]
  9. Shan, S.; Wang, G.G. Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct. Multidiscip. Optim. 2010, 41, 219–241. [Google Scholar] [CrossRef]
  10. Fadale, T.D.; Nenarokomov, A.V.; Emery, A.F. Uncertainties in parameter estimation: The inverse problem. Int. J. Heat Mass Transf. 1995, 38, 511–518. [Google Scholar] [CrossRef]
  11. Liang, B.; Mahadevan, S. Error and uncertainty quantification and sensitivity analysis in mechanics computational models. Int. J. Uncertain. Quantif. 2011, 1, 147–161. [Google Scholar]
  12. De Cursi, E.S.; Sampaio, R. Uncertainty Quantification and Stochastic Modeling with Matlab; Elsevier: Amsterdam, The Netherlands, 2015. [Google Scholar]
  13. Friedman, J.H. Multivariate adaptive regression splines. An. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
  14. Drucker, H.; Burges, C.J.; Kaufman, L.; Smola, A.J.; Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 1997, 9, 155–161. [Google Scholar]
  15. Oparaji, U.; Sheu, R.J.; Bankhead, M.; Austin, J.; Patelli, E. Robust artificial neural network for reliability and sensitivity analyses of complex non-linear systems. Neural Netw. 2017, 96, 80–90. [Google Scholar] [CrossRef] [PubMed]
  16. Sun, Z.; Wang, J.; Li, R.; Tong, C. LIF: A new kriging based learning function and its application to structural reliability analysis. Reliab. Eng. Syst. Saf. 2017, 157, 152–165. [Google Scholar] [CrossRef]
  17. Ghanem, R.; Spanos, P.D. Stochastic Finite Elements: A Spectral Approach; Springer: Berlin, Germany, 1991. [Google Scholar]
  18. Xiu, D.; Karniadakis, G.E. The Wiener–Askey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput. 2002, 24, 619–644. [Google Scholar] [CrossRef]
  19. Xiu, D.; Hesthaven, J.S. High-order collocation methods for differential equations with random inputs. SIAM J. Sci. Comput. 2005, 27, 1118–1139. [Google Scholar] [CrossRef]
  20. Xiu, D. Numerical Methods for Stochastic Computations: A Spectral Method Approach; Princeton University Press: Princeton, NJ, USA, 2010. [Google Scholar]
  21. Le Maître, O.P.; Reagan, M.T.; Najm, H.N.; Ghanem, R.G.; Knio, O.M. A stochastic projection method for fluid flow: II. Random process. J. Comput. Phys. 2002, 181, 9–44. [Google Scholar] [CrossRef]
  22. Ghiocel, D.M.; Ghanem, R.G. Stochastic finite-element analysis of seismic soil-structure interaction. J. Eng. Mech. 2002, 128, 66–77. [Google Scholar] [CrossRef]
  23. Kennedy, M.C.; O’Hagan, A. Bayesian calibration of computer models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2001, 63, 425–464. [Google Scholar] [CrossRef]
  24. Cressie, N. Statistics for spatial data: Wiley series in probability and statistics. Wiley-Intersci. N. Y. 1993, 15, 105–209. [Google Scholar]
  25. MacKay, D.J. Introduction to Gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 1998, 168, 133–166. [Google Scholar]
  26. Rasmussen, C.E. Gaussian processes in machine learning. In Advanced Lectures on Machine Learning; Springer: Berlin, Germany, 2004; pp. 63–71. [Google Scholar]
  27. Constantine, P.G.; Wang, Q. Residual minimizing model interpolation for parameterized nonlinear dynamical systems. SIAM J. Sci. Comput. 2012, 34, A2118–A2144. [Google Scholar] [CrossRef]
  28. Quiñonero-Candela, J.; Rasmussen, C.E. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 2005, 6, 1939–1959. [Google Scholar]
  29. Schobi, R.; Sudret, B.; Wiart, J. Polynomial-chaos-based kriging. Int. J. Uncertain. Quantif. 2015, 5, 171–193. [Google Scholar] [CrossRef]
  30. Schöbi, R.; Sudret, B.; Marelli, S. Rare Event Estimation Using Polynomial-Chaos kriging. ASCE-ASME J. Risk Uncertain. Eng. Syst. Part A Civ. Eng. 2017, 3, D4016002. [Google Scholar] [CrossRef]
  31. Schöbi, R.; Sudret, B. Combining polynomial chaos expansions and kriging for solving structural reliability problems. In Proceedings of the 7th International Conference on Computational Stochastic Mechanics (CSM7), Santorini, Greece, 15–18 June 2014. [Google Scholar]
  32. Schöbi, R.; Sudret, B. PC-kriging: A new meta-modeling method and its applications to quantile estimation. In Proceedings of the 17th IFIP Working Group 7.5 Conference on Reliability and Optimization of Structural Systems, Huangshan, China, 3–7 July 2014. [Google Scholar]
  33. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  34. Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
  35. Echard, B.; Gayton, N.; Lemaire, M. AK-MCS: An active learning reliability method combining kriging and Monte Carlo simulation. Struct. Saf. 2011, 33, 145–154. [Google Scholar] [CrossRef]
  36. Dubourg, V. Adaptive Surrogate Models for Reliability Analysis and Reliability-Based Design Optimization. Ph.D. Thesis, Université Blaise Pascal-Clermont-Ferrand II, Aubière, France, 2011. [Google Scholar]
  37. Kibble, W. An extension of a theorem of Mehler’s on Hermite polynomials. Math. Proc. Camb. Philos. Soc. 1945, 41, 12–15. [Google Scholar] [CrossRef]
  38. Rasmussen, C.E.; Nickisch, H. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 2010, 11, 3011–3015. [Google Scholar]
  39. Niederreiter, H. QuasiMonte Carlo Methods; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2010. [Google Scholar]
  40. Ishigami, T.; Homma, T. An importance quantification technique in uncertainty analysis for computer models. In Proceedings of the First International Symposium on Uncertainty Modeling and Analysis, College Park, MD, USA, 3–5 December 1990; pp. 398–403. [Google Scholar]
  41. Marrel, A.; Iooss, B.; Van Dorpe, F.; Volkova, E. An efficient methodology for modeling complex computer codes with Gaussian processes. Comput. Stat. Data Anal. 2008, 52, 4731–4744. [Google Scholar] [CrossRef] [Green Version]
  42. Oakley, J.E.; O’Hagan, A. Probabilistic sensitivity analysis of complex models: A Bayesian approach. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2004, 66, 751–769. [Google Scholar] [CrossRef]
Figure 1. Left: Generate the reproducing kernel Hilbert space (RKHS) with the reproducing kernel map; right: generate the reproducing Mercer kernels with the polynomial chaos expansion (PCE) basis.
Figure 1. Left: Generate the reproducing kernel Hilbert space (RKHS) with the reproducing kernel map; right: generate the reproducing Mercer kernels with the polynomial chaos expansion (PCE) basis.
Entropy 20 00191 g001
Figure 2. Comparison of the effect of M and ρ on the 1D Mehler kernel with one fixed point { 0.8 } . (a) Kernel value of M e M ( x , 0.8 ) with ρ = 0.5 ; (b) kernel value of M e ( x , 0.8 ) .
Figure 2. Comparison of the effect of M and ρ on the 1D Mehler kernel with one fixed point { 0.8 } . (a) Kernel value of M e M ( x , 0.8 ) with ρ = 0.5 ; (b) kernel value of M e ( x , 0.8 ) .
Entropy 20 00191 g002
Figure 3. Projections on the first 100 polynomials of f ( x ) and the Gaussian process on polynomial chaos basis (GPCB) with ρ = 0.1 , 0.45 , 0.7 .
Figure 3. Projections on the first 100 polynomials of f ( x ) and the Gaussian process on polynomial chaos basis (GPCB) with ρ = 0.1 , 0.45 , 0.7 .
Entropy 20 00191 g003
Figure 4. Comparisons among the GP, polynomial chaos expansion (PCE) and GPCB surrogates for the 1D example. (a) Comparison of the KL divergence, P = 10 ; (b) comparison of the ECDF of prediction, P = 10 ; (c) comparison of the RMSE with different degrees P in the 1D example.
Figure 4. Comparisons among the GP, polynomial chaos expansion (PCE) and GPCB surrogates for the 1D example. (a) Comparison of the KL divergence, P = 10 ; (b) comparison of the ECDF of prediction, P = 10 ; (c) comparison of the RMSE with different degrees P in the 1D example.
Entropy 20 00191 g004
Figure 5. Comparisons among the GP, PCE and GPCB surrogates for the 2D example. (a) Comparison of the KL divergence with p t = 7 ; (b) comparison of the ECDF of prediction, p t = 7 ; (c) comparison of the RMSE with different degrees p t in the 2D example.
Figure 5. Comparisons among the GP, PCE and GPCB surrogates for the 2D example. (a) Comparison of the KL divergence with p t = 7 ; (b) comparison of the ECDF of prediction, p t = 7 ; (c) comparison of the RMSE with different degrees p t in the 2D example.
Entropy 20 00191 g005
Figure 6. Left: X c and X in 3D view; the blue dots represent the quadrature points, while the red points represent our samplings; middle: this shows the sparsity of our sampling in X c ; right: this shows that our sampling actually covered almost every feature of X c . (a) X c and X in 3D view; (b) one slice of X c ; (c) projection of X on X c in dimensions one and two.
Figure 6. Left: X c and X in 3D view; the blue dots represent the quadrature points, while the red points represent our samplings; middle: this shows the sparsity of our sampling in X c ; right: this shows that our sampling actually covered almost every feature of X c . (a) X c and X in 3D view; (b) one slice of X c ; (c) projection of X on X c in dimensions one and two.
Entropy 20 00191 g006
Figure 7. Comparison of the RMSE between four sampling strategies with the PCE method.
Figure 7. Comparison of the RMSE between four sampling strategies with the PCE method.
Entropy 20 00191 g007
Figure 8. Comparisons among the GP, PCE and GPCB surrogates for the Ishigami function. (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Figure 8. Comparisons among the GP, PCE and GPCB surrogates for the Ishigami function. (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Entropy 20 00191 g008
Figure 9. Comparisons among the GP, PCE and GPCB surrogates for the G-function. (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Figure 9. Comparisons among the GP, PCE and GPCB surrogates for the G-function. (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Entropy 20 00191 g009
Figure 10. Comparisons among the GP, PCE and GPCB surrogates for Equation (26). (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Figure 10. Comparisons among the GP, PCE and GPCB surrogates for Equation (26). (a) Comparison of the RMSE; (b) comparison of the ECDF; N = 1000 .
Entropy 20 00191 g010

Share and Cite

MDPI and ACS Style

Yan, L.; Duan, X.; Liu, B.; Xu, J. Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence. Entropy 2018, 20, 191. https://doi.org/10.3390/e20030191

AMA Style

Yan L, Duan X, Liu B, Xu J. Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence. Entropy. 2018; 20(3):191. https://doi.org/10.3390/e20030191

Chicago/Turabian Style

Yan, Liang, Xiaojun Duan, Bowen Liu, and Jin Xu. 2018. "Gaussian Processes and Polynomial Chaos Expansion for Regression Problem: Linkage via the RKHS and Comparison via the KL Divergence" Entropy 20, no. 3: 191. https://doi.org/10.3390/e20030191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop