On a Robust MaxEnt Process Regression Model with Sample-Selection

Kim, Hea-Jung; Bae, Mihyang; Jin, Daehwa

doi:10.3390/e20040262

Open AccessArticle

On a Robust MaxEnt Process Regression Model with Sample-Selection

by

Hea-Jung Kim

^*,

Mihyang Bae

and

Daehwa Jin

Department of Statistics, Dongguk University-Seoul, Pil-Dong 3Ga, Chung-Gu, Seoul 100-715, Korea

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(4), 262; https://doi.org/10.3390/e20040262

Submission received: 2 February 2018 / Revised: 3 April 2018 / Accepted: 7 April 2018 / Published: 9 April 2018

Download

Browse Figures

Versions Notes

Abstract

:

In a regression analysis, a sample-selection bias arises when a dependent variable is partially observed as a result of the sample selection. This study introduces a Maximum Entropy (MaxEnt) process regression model that assumes a MaxEnt prior distribution for its nonparametric regression function and finds that the MaxEnt process regression model includes the well-known Gaussian process regression (GPR) model as a special case. Then, this special MaxEnt process regression model, i.e., the GPR model, is generalized to obtain a robust sample-selection Gaussian process regression (RSGPR) model that deals with non-normal data in the sample selection. Various properties of the RSGPR model are established, including the stochastic representation, distributional hierarchy, and magnitude of the sample-selection bias. These properties are used in the paper to develop a hierarchical Bayesian methodology to estimate the model. This involves a simple and computationally feasible Markov chain Monte Carlo algorithm that avoids analytical or numerical derivatives of the log-likelihood function of the model. The performance of the RSGPR model in terms of the sample-selection bias correction, robustness to non-normality, and prediction, is demonstrated through results in simulations that attest to its good finite-sample performance.

Keywords:

Gaussian process model; hierarchical Bayesian methodology; robust sample-selection MaxEnt process regression model; Markov chain Monte Carlo; sample-selection bias; bias correction

MSC:

62G08; 62F15

1. Introduction

The Bayesian nonparametric method is a powerful approach for regression problems when the shape of the underlying regression function is unknown, the function may be difficult to evaluate analytically, or other requirements such as design costs may complicate the process of information acquisition process. Bayesian orthogonal basis expansion regression, spline smoothing regression, wavelet regression, and Gaussian process regression (GPR) are powerful nonparametric Bayesian approaches that address these regression problems. These regression techniques have been extensively used in fields such as psychology, data science, engineering, neuroscience, and fishery [1,2,3,4,5].

Sample selection (or incidental truncation) in regression analysis is known to often arise in a wide variety of practical problems and standard analysis of data with sample selection leads to biased results because the selected sample represents only a subset of a full population; see [6,7,8]. Regression analysis also has problems regarding sensitivity to outliers and departures from the normality of the dependent variable (see [9,10]). Thus, when one implements nonparametric Bayesian regression with non-normal data with sample selection, the selection mechanism and non-normality of the data must be jointly modeled with the Bayesian nonparametric regression model to correct the sample-selection bias and to implement a robust statistical inference. In this regard, several estimation procedures have been considered in the literature to produce robust linear regression models that are subject to sample selection including for instance [6,7], for frequentist methods, and [8,9,11] for Bayesian methods. See [9,12] to obtain robust Bayesian sample-selection models other than the regression model. In addition, no studies have generalized a nonparametric regression model to deal with non-normal data with the sample selection.

The objective of this paper is to introduce the Maximum Entropy (MaxEnt) process regression model as a new Bayesian nonparametric regression model and to then generalize this model to propose a robust sample-selection Bayesian nonparametric regression model along with its inferential methodology. The MaxEnt process regression model is obtained by assuming a MaxEnt prior distribution for its nonparametric regression function, and it includes the GPR model as a special case. This provides a relationship between the MaxEnt nonparametric regression approach and the rationale to conduct a Gaussian regression analysis. This study focuses on the GPR model as a special MaxEnt process regression model and a powerful analysis model towards nonparametric regression problems. Then, the GPR model is generalized to obtain a robust sample-selection Gaussian process regression (RSGPR) model. This RSGPR model extends the GPR model to account for the sample selection scheme, and it is robust when the data are heavy-tailed or contain outliers.

The RSGPR model consists of two components. The first is a robust GPR model that determines the level of the dependent variable of interest and the second is an equation that describes the selection mechanism that determines whether we have observed the dependent variable or not. The sample-selection bias arises when these two components are correlated and must be modeled jointly. A Bayesian hierarchical methodology is developed here to estimate the RSGPR model. This methodology relies on a stochastic representation technique (see, e.g., [13]) to set up the Bayesian hierarchy of the RSGPR model, and it has three attractive features. First, given the likelihood function of the model, the posterior of its parameters does not belong to any well-known parametric family, but the methodology uses a simple Markov chain Monte Carlo (MCMC) algorithm that does not resort to generating random draws from the complex posterior. Second, the output of the algorithm not only provides a Bayesian analogue of confidence intervals for the regression function, but it also readily gives an indication of the presence (or absence) of the sample-selection bias. Third, if there is prior information, such as restrictions on the regression function, such information can be incorporated easily through a prior distribution.

The remainder of this study is organized as follows. Section 2 introduces the MaxEnt process regression model that strictly includes the GPR model. Then, this section formulates the RSGPR model that is obtained by incorporating the GPR model with a class of scale mixtures of normal errors as well as a selection model comprising a class of scale mixtures of the probit sample selection equations. Properties of this RSGPR model are studied including the exact distribution of a selected observation, a stochastic representation, a distributional hierarchy, and the magnitude of the sample-selection bias. In Section 3, we construct a Bayesian hierarchical model for inference in the RSGPR model by exploiting the stochastic representation and distributional hierarchy. Then we develop a Bayesian estimation methodology based on the hierarchical model to provide a simple estimation procedure for the RSGPR model. We further construct a computationally feasible MCMC algorithm through a Bayesian hierarchical approach. Section 4 examines the finite-sample performance of the method through a limited but informative simulation. This numerical illustration shows the usefulness of the RSGPR model for the Gaussian process regression analysis of non-normal data with the sample selection. The study then concludes with a discussion in Section 5. Proofs and additional details are provided in the Appendix A.

2. Robust Sample-Selection GPR Model

2.1. MaxEnt Process Regression Model

Consider the following nonparametric regression model,

y_{n} = η_{n} (x) + ϵ_{n},

(1)

where

y_{n} = {(y_{1}, \dots, y_{n})}^{⊤}

is

n \times 1

vector of responses,

y_{i} = η (x_{i}) + ϵ_{i},

η_{n} (x) = {(η (x_{1}), \dots, η (x_{n}))}^{⊤}

is

n \times 1

vector of regression function values satisfying

η (x_{i}) = E [y_{i} | x_{i}],

i = 1, \dots, n

,

x = {(x_{1}, \dots, x_{n})}^{⊤}

is the

n \times p

design matrix, and

ϵ_{n} = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}

is a

n \times 1

vector of i.i.d. random noises with zero mean vector. In the basic model structure of (1), the parametric form of the regression function

η_{n} (x)

is not assumed, but

η_{n} (x)

is assumed to have specific types of functional structure. For example,

η_{n} (x)

can be represented with a Fourier series [14], splines [15], kernels [16] and others.

In the Bayesian nonparametric regression, we assume that the regression function (or signal term)

η_{n} (x)

is a random function that follows a particular distribution. This distribution is subjective in the sense that the distribution reflects our uncertain prior information regarding the function. Sometimes, we have a situation in which partial prior information on

η_{n} (x)

is available, outside of which it is desirable to use a prior that is as non-informative as possible. In this situation, Boltzmann’s maximum entropy theorem (see, e.g., [17]) yields a maximum entropy prior

π_{m a x} (η_{n} (x))

that is an exponential form and maximizes the entropy,

H (π) = - \int_{R^{n}} π (η_{n} (x)) log π (η_{n} (x)) d η_{n} (x),

in the presence of partial information for various moment functions of

η_{n} (x) .

In a special case where we only have partial prior information about the mean vector and covariance matrix functions of

η_{n} (x)

of the Bayesian nonparametric regression model (1), Boltzmann’s maximum entropy theorem yields the following prior distribution of

η_{n} (x)

.

Lemma 1.

Let

n \times 1

regression function vector

η_{n} (x)

have a prior distribution on

R^{n}

whose partial information on the mean and covariance functions are

m (x) = {(m (x_{1}), \dots, m (x_{n}))}^{⊤}

and

K (x) = {κ (x_{i}, x_{j})},

respectively. Then the maximum entropy prior of

η_{n} (x)

is

π_{m a x} (η_{n} (x)) = {(2 π)}^{- n / 2} {| K (x) |}^{- 1 / 2} exp {- \frac{1}{2} {(η_{n} (x) - m (x))}^{⊤} K {(x)}^{- 1} (η_{n} (x) - m (x))}

(2)

for

η_{n} (x) \in R^{n} .

This is a density of

G P (m (x), K (x)),

a Gaussian process defined by the mean function

m (x)

and the covariance function

K (x)

.

Note that the Gaussian process

G P (m (x), K (x))

defines a collection of random functions wherein any finite subset of the process has multivariate normal (Gaussian) distribution. From now on, we will write the Gaussian process as

η_{n} (x) \sim G P (m (x), K (x)) .

The only restriction on the Gaussian process is that the covariance function

K (x)

must be an

n \times n

positive definite symmetric (pds) matrix. If

K (x)

is not a pds matrix, then the corresponding value of

H (π_{m a x}) = p (1 + log (2 π)) / 2 + log (| K (x) |) / 2

will not be defined (see, e.g., [18]). As a result, this paper introduces yet another Bayesian nonparametric regression model by combining the regression model (1) and the MaxEnt prior in Lemma 1 and introducing a normal regression error distribution. The model is named as a “MaxEnt process regression model” and defined by

\begin{matrix} y_{n} & = & η_{n} (x) + ϵ_{n}, ϵ_{n} \sim N_{n} (0, σ^{2} I_{n}), \\ σ^{2} & \sim & I G (ν_{1}, ν_{2}), \\ η_{n} (x) & \sim & G P (m (x), K (x)), \end{matrix}

(3)

where

N_{n} (0, σ^{2} I_{n})

is an n-variate normal distribution with mean vector

0

and covariance matrix

σ^{2} I_{n},

I G (ν_{1}, ν_{2})

denotes an inverse gamma distribution with shape parameter

ν_{1}

and scale parameter

ν_{2},

η_{n} (x)

is independent of

σ^{2}

and

ϵ_{n},

the mean function

m (x_{i})

reflects the expected function value at input

x_{i}

, i.e.,

m (x_{i}) = E [η (x_{i})],

and the covariance function

κ (x_{i}, x_{j})

models the dependence between the function values at different input points

x_{i}

and

x_{j},

i.e.,

κ (x_{i}, x_{j}) = E [(η (x_{i}) - m (x_{i})) (η (x_{j}) - m (x_{j}))],

i, j = 1, \dots, n .

See [19] for choice of an appropriate covariance function based on assumptions such as smoothness and likely patterns to be expected in the data. A commonly used isotropic covariance function in practice is the squared exponential covariance function given by

κ (x_{i}, x_{j}) = C o v (η (x_{i}), η (x_{j})) = u_{0} exp \{- \frac{w_{0}}{2} {(x_{i} - x_{j})}^{⊤} (x_{i} - x_{j})\},

(4)

where

u_{0}

and

w_{0}

are hyperparameters and which are relevant for the shape of MaxEnt process regression. Here

u_{0}

stands for global scale of the covariance matrix

K (x)

and

w_{0}

stands for smoothing parameter, respectively.

We can easily see that the MaxEnt process regression model (5) is the same as the GPR model considered by [19,20]. This proves the following corollary and can hence be used as an information theoretic justification for using the GPR model as a Bayesian approach for a nonparametric regression analysis.

Corollary 1.

Suppose

E [η_{n} (x)] = m (x)

and

Cov (η_{n} (x)) = K (x)

are all the prior information on

η_{n} (x)

for the Bayesian nonparametric regression model (1). Then MaxEnt prior distribution of

η_{n} (x)

is

G P (m (x), K (x))

which defines the GPR model.

According to Corollary 1, we shall denote the MaxEnt process regression model by the GPR model. When there is no functional constraint in the GPR model, then the prior specification in the model (3) can be used, and posterior inference can be performed without difficulty. It is seen, from [20], that the conditional posterior distribution of

η_{n} (x)

is normal with the mean and covariance given by

\begin{matrix} E [η_{n} (x) | (y_{n}, x), σ^{2}] & = & K (x) {(K (x) + σ^{2} I_{n})}^{- 1} y_{n} + σ^{2} I_{n} {(K (x) + σ^{2} I_{n})}^{- 1} m (x), \\ Var [η_{n} (x) | (y_{n}, x), σ^{2}] & = & {(K {(x)}^{- 1} + σ^{- 2} I_{n})}^{- 1} = σ^{2} K (x) {(K (x) + σ^{2} I_{n})}^{- 1} . \end{matrix}

(5)

However, in the GPR analysis, a sample-selection scheme often applies to the response variable that results in missing not at random (MNAR) observations on the variable. In this case, the regression analysis using only the selected cases will lead to biased results (see, e.g., [6,7,8]). This study provides a bias correction procedure for the GPR analysis with MNAR data generated from the sample-selection scheme. For the analysis, we propose a robust sample-selection GPR (RSGPR) model based on a family of scale mixtures of normal (SMN) distributions (see [21,22] for details). This approach reflects the MNAR mechanism as well as its robustness against departures from the normality assumption (see, e.g., [11,12]), and proposes a robust GPR model to analyze the partially observed sample-selection data.

2.2. Proposed Model

We propose the RSGPR model through the following steps. First, we modify the GPR model (3) by incorporating the SMN error distribution for a robust GPR analysis. Then, we connect the robust GPR model directly to a sample-selection model by introducing some latent variables to explain the partially observed sample-selection data. To model the sample-selection mechanism, we need to introduce some notation for the partially observed data.

Let

s_{i}

be a binary variable that takes on value 1 if

y_{i}

of subject i is observed using the sample-selection scheme, and 0 if that of the subject is not observed using the same scheme. Then, we introduce the following RSGPR model to represent the regression equation of the observable variable

y_{i} :

\begin{matrix} y_{i} & = & \{\begin{matrix} η (x_{i}) + ϵ_{i} & for s_{i} = I (z_{i} \geq 0), \\ missing & for s_{i} = I (z_{i} < 0), \end{matrix} \\ z_{i} & = & v_{i}^{⊤} γ + ε_{i}, i = 1, \dots, n, (\begin{matrix} ϵ_{i} \\ ε_{i} \end{matrix}) \overset{i i d}{\sim} {S M N}_{2} (0, Σ, δ, G), \\ η_{n} (x) & \sim & G P (m (x), K (x)), \end{matrix}

(6)

where

I (\cdot)

is an indicator function,

{SMN}_{2} (0, Σ, δ, G),

a scale mixture of bivariate normal distributions with mixture function

δ (ω)

and mixing variable

ω \sim G .

Here

γ = {(γ_{1}, \dots, γ_{q})}^{⊤}

and

Σ = (\begin{matrix} σ^{2} & ρ σ \\ ρ σ & 1 \end{matrix})

are parameters to be elicited by using the priors

p_{0} (γ)

and

p_{0} (Σ)

.

Without loss of generality, we assume that the single sample selection scheme

s_{i} = I (z_{i} \geq 0)

is applied to a random sample of n observations (

y_{i}

’s) associated with the model (1) and gives only the first

n_{1}

observed values of

y_{i}

’s out of the n

(n > n_{1})

observations according to the sample-selection scheme. Thus, the overall available data information of the RSGPR model consists of the set of

s_{i}

binary values and the

n_{1}

-tuples of observations

(y_{i}, v_{i})

corresponding to individuals with

s_{i} = 1,

while

v_{i}

for those with

s_{i} = 0 .

The purpose of this study is to estimate the regression function

η_{n} (x)

based on partially observed data (i.e., sample-selection data) with size

n_{1} .

For fixed

η (x_{i})

, the density of the RSGPR model (6) is composed of a continuous component

h (y_{i} | s_{i} = 1)

and a discrete component

p (s_{i}) .

The discrete component is

\begin{matrix} p (s_{i}) & = & {[\bar{F} (C_{i}; 0, 1)]}^{s_{i}} {[1 - \bar{F} (C_{i}; 0, 1)]}^{1 - s_{i}}, \end{matrix}

(7)

where

\bar{F} (C_{i}; d, τ) = \int_{0}^{\infty} \bar{Φ} (C_{i}; d, δ (ω) τ) d G (ω)

with a selection interval

C_{i} = (α_{i}, \infty),

α_{i} = - v_{i}^{⊤} γ,

and

\bar{Φ} (C_{i}; d, κ (ω) τ) = \int_{C_{i}} ϕ (x; d, δ (ω) τ) d x

denotes the probability of the interval

C_{i}

under the

N (d, δ (ω) τ)

distribution with the density

ϕ (x; d, δ (ω) τ) .

The continuous component is a density of

[y_{i} | η (x_{i}), s_{i} = 1] \overset{d}{=} [y_{i} | η (x_{i}), ε_{i} \in C_{i}]

for

i = 1, \dots, n_{1} .

This density is given by

\begin{matrix} h (y_{i} | s_{i} = 1) & = & \frac{\int_{0}^{\infty} ϕ (y_{i}; η (x_{i}), δ (ω) σ^{2}) \bar{Φ} (C_{i}; θ_{ε_{i} | y_{i}}, δ (ω) σ_{ε_{i} | y_{i}}^{2}) d G (ω)}{\bar{F} (C_{i}; 0, 1)}, y_{i} \in R, \end{matrix}

(8)

where

θ_{ε_{i} | y_{i}} = ρ [y_{i} - η (x_{i})] / σ,

and

σ_{ε_{i} | y_{i}}^{2} = 1 - ρ^{2} .

This distribution is essentially a member of the class of skew-scale mixtures of normal (skew-SMN) distributions discussed by [13,23,24]. We will denote the distribution law of

[y_{i} | η (x_{i}), s_{i} = 1]

with density (8) by skew-

SMN (C_{i}; θ_{i}, Σ, δ, G),

where

θ_{i} = {(η (x_{i}), 0)}^{⊤} .

The following lemma is useful to generate the partially observed

y_{i}

’s and indicate the difference between the RSGPR model (6) and the GPR model (3).

Lemma 2.

For a given value of

η (x_{i}),

the selected observation

[y_{i} | η (x_{i}), s_{i} = 1]

for the RSGPR model can be represented by the following two-stages of distributional hierarchy:

\begin{matrix} [y_{i} | ω, η (x_{i}), s_{i} = 1] & \overset{d}{=} & η (x_{i}) + ρ σ Z_{C_{i}} + σ {(1 - ρ^{2})}^{1 / 2} U_{i}, i = 1, \dots, n_{1}, \\ ω & \sim & G (ω), \end{matrix}

(9)

where

U_{i} \overset{i i d}{\sim} N (0, δ (ω))

and

Z_{C_{i}} \overset{i n d}{\sim} {TN}_{C_{i}} (0, δ (ω))

are independent conditionally on

ω,

and

{TN}_{C_{i}} (0, δ (ω))

denotes a

N (0, δ (ω))

distribution truncated to the interval

C_{i} = (α_{i}, \infty) .

Lemma 2 shows that the RSGPR model applies to relax the classic assumption of the underlying normality as well as to reflect the sample-selection scheme. This lemma also indicates that the partially observed data

y_{i}

’s does not represent a random sample from the GPR model generating

y_{i}

’s, even after controlling for the regression function

η (x_{i}) .

If we want to apply a GPR analysis to the partially observed sample-selection data, a fitted model should be the RSGPR model. The RSGPR model changes depending on the choice of the distribution of

ω

and its function

δ (ω) .

In the special case wherein the distribution of

ω

degenerates at

δ (ω) = 1,

the RSGPR model produces a sample-selection Gaussian process normal error regression (SGPR

_{N}

) model. When we choose

ω \sim G (ν / 2, ν / 2),

a gamma distribution with mean 1 and

δ (ω_{i}) = 1 / ω_{i},

the model becomes a sample-selection Gaussian process

t_{ν}

error regression (SGPR

_{t_{ν}}

) model, allowing to regulate the tail distribution of the model by means of the degrees of freedom. We also see that the RSGPR model strictly includes the GPR model because the latter is obtained by setting

ρ = 0 .

For the remainder of this study, we use the symbols in the preceding sections with the same definitions.

2.3. The Sample-Selection Bias

As indicated by the density (8) and Lemma 2 the selected observations

[y_{i} | s_{i} = 1]

’s do not represent a random sample from the GPR model generating

y_{i}

’s, but they are missing not at random (MNAR) [25] inducing a sample-selection bias. The following results on the sample-selection bias are noted in the Bayesian estimation of the GPR model with the partially observed data.

Lemma 3.

Given the RSGPR model (6), a stochastic representation of conditional posterior distribution of the regression function

η_{n_{1}} = {(η (x_{1}), \dots, η (x_{n_{1}}))}^{⊤}

is

\begin{matrix} [η_{n_{1}} | y, ω, Ψ] & \overset{d}{=} & θ_{1} + Γ Ω_{2}^{- 1} W_{1}^{C_{β}} + W_{2}, \end{matrix}

(10)

where

Ψ = {σ^{2}, ρ, γ},

W_{1} \sim N_{n_{1}} (0, Ω_{2})

and

W_{2} \sim N_{n_{1}} (0, Ω_{1} - Γ Ω_{2}^{- 1} Γ^{⊤})

are independent random vectors,

W_{1}^{C_{β}} \overset{d}{=} [W_{1} | W_{1} \in C_{β}],

C_{β} = \cap_{i = 1}^{n_{1}} {w_{1 i}; β_{i} \leq w_{1 i} \leq \infty},

W_{1} = {(w_{11}, \dots, w_{1 n_{1}})}^{⊤},

β_{i} = (α_{i} - θ_{2 i}) / \sqrt{δ (ω)},

θ_{1} = K_{11} {(x)}^{- 1} H^{- 1} y_{n_{1}} + δ (ω) σ^{2} H^{- 1} m_{1} (x),

θ_{2} = {(θ_{21}, \dots, θ_{2 n_{1}})}^{⊤} = ρ (y_{n_{1}} - m_{1} (x)) / σ,

Γ = - ρ Ω_{1} / σ,

H = (K_{11} (x) + δ (ω) σ^{2} I_{n_{1}}),

Ω_{1} = δ (ω) σ^{2} K_{11} (x) H^{- 1},

Ω_{2} = (1 - ρ^{2}) I_{n_{1}} + ρ^{2} Ω_{1} / σ^{2},

y_{n_{1}} = {(y_{1}, \dots, y_{n_{1}})}^{⊤}

be an

n_{1} \times 1

observed vector,

m_{1} (x) = {(m (x_{1}), \dots, m (x_{n_{1}}))}^{⊤},

and

K_{11} (x)

is the first

n_{1} \times n_{1}

diagonal sub-matrix of

K (x) .

As shown in Lemma 3, if we use the partially observed

y_{i}

’s to estimate the GPR model, the existence of the truncated normal distribution term (i.e.,

W_{1}^{C_{β}}

) in Equation (10) induces a biased estimation of the regression function. Note that the distribution becomes normal (i.e.,

W_{1} \sim N_{n_{1}} (0, Ω_{2})

) in the GPR model for the case where

y_{i}

’s are fully observed. Therefore, the usual estimation of the regression function based on the GPR model will produce inconsistent results when

ρ \neq 0 .

This clearly reveals that sample-selection bias occurs in Bayes estimation of the regression function

η_{n_{1}} .

The magnitude of this bias is as follows.

Corollary 2.

Instead of the SGPR

_{N},

if the GPR model is used for estimating

η_{n_{1}}

based on observed data

y_{n_{1}}

then a sample-selection bias occurs in its conditional posterior mean. This bias is

E [η_{n_{1}} | y_{n_{1}}, Ψ] - E [η_{n_{1}} | y_{n_{1}}, σ^{2}] = - {(\frac{ρ}{σ} I_{n_{1}} + \frac{σ (1 - ρ^{2})}{ρ} Ω_{1}^{* - 1})}^{- 1} ξ,

where

ξ = E [W_{1}^{C_{β}}] = {(ξ_{1}, \dots, ξ_{n_{1}})}^{⊤},

ξ_{i} = ω_{i i}^{*} ϕ (β_{i}; 0, ω_{i i}^{*}) / [1 - Φ (β_{i} / \sqrt{ω_{i i}^{*}})],

ω_{i i}^{*}

denotes the i-th diagonal element of

Ω_{2}^{*},

Ω_{1}^{*} = Ω_{1} |_{δ (ω) = 1},

and

Ω_{2}^{*} = Ω_{2} |_{δ (ω) = 1} .

The sample-selection bias in calculating the marginal effect (or propensity) of a predictor can be also expected.

Corollary 3.

Suppose that

v_{k i} = x_{k i},

where

v_{k i}

and

x_{k i}

are k-th element of

v_{i}

and

x_{i},

then difference in the marginal effect of the predictor

x_{k i}

on the selected observation

y_{i}

between the RSGPR model and the GPR model is

\begin{matrix} γ_{k} ρ σ E_{ω} [\frac{1}{δ (ω)} (δ_{2} (v_{i}^{⊤} γ, ω) - δ_{1} {(v_{i}^{⊤} γ, ω)}^{2})], \end{matrix}

(11)

where

γ_{k}

is the k-th element of

γ,

\begin{matrix} b_{1} (v_{i}^{⊤} γ, ω) & = & δ (ω) ϕ (α_{i}; 0, δ (ω)) / [1 - Φ (α_{i} / \sqrt{δ (ω)})], \\ b_{2} (v_{i}^{⊤} γ, ω) & = & α_{i} δ (ω) ϕ (α_{i}; 0, δ (ω)) / [1 - Φ (α_{i} / \sqrt{δ (ω)})], \end{matrix}

α_{i} = - v_{i}^{⊤} γ,

and

E_{ω}

denotes the expectation is taken with respect to the distribution of

ω \sim G (ω) .

To compare the SGPR

_{N}

model with the

G P R

model, various values of the sample-selection bias associated with the posterior mean (see, Corollary 2) and the difference in the marginal effect of the k-th predictor (see, Corollary 3) were calculated and are depicted in Figure 1. For the calculation, we set

σ = 1,

γ_{k} = 1,

and

K_{11} (x) = 0.5 I_{n_{1}} + 0.5 1_{n_{1}} 1_{n_{1}}^{⊤} / n_{1},

an intra-class covariance matrix, where

1_{n_{1}}

is an

n_{1} \times 1

summing vector whose elements are all one. The left panel in Figure 1 is a graph of the sample-selection bias for different values of

β_{i}

and

ρ .

This graph shows the values of the first element of the bias vector given in Corollary 2. From the graph, we see that the sample-selection bias occurs in the GPR analysis with sample selection, and its magnitude becomes larger as the values of

| ρ |

or

β_{i}

become larger. The sign of the bias is opposite to that of

ρ .

The right panel shows a graph of the difference in the marginal effect (defined by Equation (11)) as a function of

α_{i}

and

ρ .

This graph shows that the absolute value of the difference increases rapidly as

α_{i}

tends to have a large value, and this difference tends to be larger as the absolute value of

ρ

becomes larger. Furthermore, the signs of the difference and

ρ

are different, which is expected for the case where

γ_{k} > 0 .

These panels imply that an inconsistent nonparametric regression analysis is unavoidable, provided that the GPR model is fitted to the partially observed sample-selection data. Instead, the proposed RSGPR model should be used to correct the sample-selection bias and to estimate the true marginal effect of each predictor in the regression analysis.

3. Bayesian Hierarchical Methodology

3.1. Hierarchical Representation of the RSGPR Model

Let us revisit the RSGPR model (6) in Section 2.2. From Equations (7) and (8), we see that the log-likelihood function of the RSGPR model based on the partially observed n-tuples of observations

(y_{i}, x_{i}, v_{i}, s_{i})

is

\begin{matrix} l (η_{n_{1}}, γ, ρ, σ^{2}) & = & \sum_{i = 1}^{n} [s_{i} \{ln \bar{F} (C_{i}; 0, 1) + ln h (y_{i} | s_{i} = 1)\} + (1 - s_{i}) ln \{1 - \bar{F} (C_{i}; 0, 1)\}] . \end{matrix}

(12)

This is a complex function for the Bayesian estimation of the parameters (

η_{n_{1}}

and

Ψ

) of the RSGPR model. Instead, the following hierarchical representation of the RSGPR model is useful for a simple estimation of the parameters.

First, the likelihood function in Equation (12) can be represented by the following distributional hierarchy.

Theorem 1.

For the n-pairs of independent observations,

(y_{i}, s_{i}),

generated from the RSGPR model defined by Equation (6), their distribution can be written by the following Bayesian hierarchical model:

\begin{matrix} [y_{i} | ω_{i}, z_{C_{i}}, s_{i} = 1] & \sim & N (η (x_{i}) + ζ z_{C_{i}}, δ (ω_{i}) τ^{2}), \\ p (s_{i} | z_{i}, ω_{i}) & = & I (z_{i} \geq 0) I (s_{i} = 1) + I (z_{i} < 0) I (s_{i} = 0), \\ [z_{i} | ω_{i}] & \sim & N (v_{i}^{⊤} γ, δ (ω_{i})), \\ ω_{i} & \sim & G (ω_{i}), i = 1, \dots, n, \\ η_{n_{1}} & \sim & N_{n_{1}} (m_{1} (x), K_{11} (x)), \\ [ζ | τ^{2}] & \sim & N (θ_{0}, σ_{0} τ^{2}), \\ τ^{2} & \sim & IG (c, d), \\ γ & \sim & N_{q} (γ_{0}, Ω_{0}), \end{matrix}

where

z_{C_{i}} = z_{i} - v_{i}^{⊤} γ,

ζ = ρ σ,

τ^{2} = σ^{2} (1 - ρ^{2}),

IG (c, d)

denotes an inverse gamma distribution with the p.d.f.

I G (τ^{2}; c, d) = d^{c} τ^{- 2 (c + 1)} e^{- d / τ^{2}} / Γ (c),

and

G (\cdot)

is a distribution function of the scale mixing variable

ω .

When the prior information on

ξ,

τ^{2},

and

γ

is not available, a convenient strategy of avoiding improper posterior distribution is to use proper priors with their hyperparameters fixed as appropriate quantity to reflect the diffuseness of the priors (i.e., limiting non-informative priors). For this convenience, the prior distributions in Theorem are used to elicit the prior distributions of

ξ,

τ^{2},

and

γ .

All hyperparameters that appeared in the prior distributions of the Bayesian hierarchical model are assumed to be given from the prior information of previous studies or other sources.

3.2. Full Conditional Posteriors

Let

y_{n_{1}} = {(y_{1}, \dots, y_{n_{1}})}^{⊤}

,

V = (v_{1}, \dots, v_{n}),

and

s = {(s_{1}, \dots, s_{n})}^{⊤}

be observed. Further suppose that

z = {(z_{1}, \dots, z_{n})}^{⊤}

and

ω = {(ω_{1}, \dots, ω_{n})}^{⊤}

are the latent observation vector and the scale mixing vector, respectively. Then, based on the RSGPR model, we obtained joint posterior distribution of

Θ = {η_{n_{1}}, τ^{2}, ζ, γ, z, ω}

given the observed data set

D_{n} = {y_{n_{1}}, V, s} :

\begin{matrix} p (Θ | D_{n}) & \propto & \prod_{i = 1}^{n_{1}} ϕ (y_{i}; η (x_{i}) + ζ z_{C_{i}}, δ (ω_{i}) τ^{2}) \\ \times & \prod_{i = 1}^{n} [p (s_{i} | z_{i}, ω_{i}) ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i})) g (ω_{i})] \\ \times & I G (τ^{2}; c, d) ϕ (ζ; θ_{0}, σ_{0} τ^{2}) ϕ_{n_{1}} (η_{n_{1}}; m_{1} (x), K_{11} (x)) ϕ_{q} (γ; γ_{0}, Ω_{0}), \end{matrix}

(13)

where

g (\cdot)

is the p.d.f. of the scale mixing variable

ω .

Note that the joint posterior in (13) is not simplified in an analytic form of the known density and is thus intractable for posterior inference. Instead, we derive conditional posterior distribution of each parameter in

Θ

in an explicit form, which will be useful for posterior inference by using a Markov chain Monte Carlo (MCMC) method.

Given the joint posterior distribution (13), we can obtain the following posterior distributions whose derivations are provided in Appendix A:

(1): The full conditional posterior distribution of $η_{n_{1}}$ is given by

$[η_{n_{1}} | Θ_{∖ η_{n_{1}}}, D_{n}] \sim N_{n_{1}} (θ_{η_{n_{1}}}, Σ_{η_{n_{1}}}),$

(14)

where $θ_{η_{n_{1}}} = Σ_{η_{n_{1}}} (K_{11} {(x)}^{- 1} m_{1} (x) + D_{1} {(δ (w))}^{- 1} (y_{n_{1}} - ζ z_{C}) / τ^{2}), Σ_{η_{n_{1}}} = {(K_{11} {(x)}^{- 1} + D_{1} {(δ (w))}^{- 1} / τ^{2})}^{- 1},$ $D_{1} (δ (ω)) = diag {δ (ω_{1}), \dots, δ (ω_{n_{1}})},$ $z_{C} = {(z_{C_{1}}, \dots, z_{C_{n_{1}}})}^{⊤},$ and $z_{C_{i}} = z_{i} - v_{i}^{⊤} γ .$
(2): The full conditional posterior distribution of $τ^{2}$ is an inverse Gamma distribution:

$[τ^{2} | Θ_{∖ τ^{2}}, D_{n}] \sim IG (c + \frac{n_{1} + 1}{2}, d + \frac{1}{2} \sum_{i = 1}^{n_{1}} \frac{{(y_{i} - η (x_{i}) - ζ z_{C_{i}})}^{2}}{δ (ω_{i})} + \frac{{(ζ - θ_{0})}^{2}}{2 σ_{0}}) .$

(15)
(3): The full conditional posterior distribution of $ζ$ is a normal distribution:

$\begin{matrix} [ζ | Θ_{∖ ζ}, D_{n}] \sim N (θ_{ζ}, σ_{ζ}^{2}), \end{matrix}$

(16)

where

$θ_{ζ} = \frac{θ_{0} / σ_{0} + \sum_{i = 1}^{n_{1}} (y_{i} - η (x_{i})) z_{C_{i}} / δ (ω_{i})}{1 / σ_{0} + \sum_{i = 1}^{n_{1}} z_{C_{i}}^{2} / δ (ω_{i})} and σ_{ζ}^{2} = (\frac{1}{σ_{0} τ^{2}} + \frac{\sum_{i = 1}^{n_{1}} z_{C_{i}}^{2}}{δ (ω_{i}) τ^{2}})^{- 1} .$
(4): The full conditional posterior distributions of $z_{i}$ ’s are independent and their distributions are given by

$\begin{matrix} [z_{i} | Θ_{∖ z_{i}}, D_{n}] \overset{i n d}{\sim} \{\begin{matrix} {TN}_{(- \infty, 0)} (v_{i}^{⊤} γ, δ (ω_{i})) & if s_{i} = 0, \\ {TN}_{(0, \infty)} (θ_{z_{i}}, σ_{z_{i}}^{2}) & if s_{i} = 1 \end{matrix} \end{matrix}$

(17)

for $i = 1, \dots, n,$ where

$θ_{z_{i}} = v_{i}^{⊤} γ + \frac{ζ (y_{i} - η (x_{i}))}{ζ^{2} + τ^{2}} and σ_{z_{i}}^{2} = \frac{δ (ω_{i}) τ^{2}}{ζ^{2} + τ^{2}} .$
(5): The full conditional posterior density of $γ$ is:

$[γ | Θ_{∖ γ}, D_{n}] \propto N_{q} (θ_{γ}, Σ_{γ}),$

(18)

where $θ_{γ} = Σ_{γ} (\sum_{i = 1}^{n} \frac{1}{δ (ω_{i})} z_{i} v_{i} + \sum_{i = 1}^{n_{1}} \frac{ζ}{τ^{2} δ (ω_{i})} (ζ z_{i} + η (x_{i}) - y_{i}) v_{i} + Ω_{0}^{- 1} γ_{0})$ and $Σ_{γ} = (Ω_{0}^{- 1} + \sum_{i = 1}^{n} \frac{1}{δ (ω_{i})} v_{i} v_{i}^{⊤} + \sum_{i = 1}^{n_{1}} \frac{ζ^{2}}{τ^{2} δ (ω_{i})} v_{i} v_{i}^{⊤})^{- 1} .$
(6): The full conditional posterior densities of $ω_{i}$ ’s are independent and they are given by

$\begin{matrix} p (ω_{i} | Θ_{∖ ω_{i}}, D_{n}) & \propto & ϕ (y_{i}; η (x_{i}) + ζ z_{C_{i}}, δ (ω_{i}) τ^{2}) ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i})) g (ω_{i}) I (i \leq n_{1}) \\ + & ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i})) g (ω_{i}) I (i > n_{1}) . \end{matrix}$

(19)

3.3. Markov Chain Monte Carlo Method

The MCMC scheme, working with the full conditional distributions of the parameters in

Θ,

is not complicated to implement. A routine Gibbs sampler can be used to generate posterior samples of

η_{n_{1}},

τ^{2},

ζ,

z_{i},

and

γ

based on each of their full conditional posterior distributions obtained in Section 3.2. In posterior sampling of

ω_{i}

’s, Metropolis–Hastings (M–H) within the Gibbs algorithm can be applied because their conditional posterior densities may not have explicit form of known distribution as in Equation (19). For Gibbs sampling, one should note the following points:

(1): Given the initial values of $Θ^{(0)},$ the implementation of the Gibbs sampler involves R iterative sampling from each of the full conditional posterior distributions obtained in Equation (14) through Equation (19).
(2): Gibbs samples of $ρ$ and $σ^{2}$ can be obtained by using those of $ζ = ρ σ$ and $τ^{2} = σ^{2} (1 - ρ^{2})$ .
(3): If $ω_{i}$ degenerates at $δ (ω_{i}) = 1,$ the RSGPR model can be reduced to the SGPR $_{N}$ model. In this case, the MCMC procedure excludes the Gibbs sampling of $ω_{i}$ ’s by using the posterior distribution (19).
(4): For various distributions of mixing variable $ω_{i}$ and mixing functions $δ (ω_{i})$ of the SMN distributions such as $t_{ν}$ , $l o g i t$ , $s t a b l e,$ $s l a s h,$ and $e x p o n e n t i a l$ $p o w e r$ models (see, e.g., [21,22]).
(5): When $ω_{i} \overset{i i d}{\sim} G (ν / 2, ν / 2)$ and $δ (ω_{i}) = 1 / ω_{i},$ the RSGPR model becomes the SGPR $_{t_{ν}}$ model. For generating $ω_{i}$ ’s, we may use the following posteriors

$\begin{matrix} ω_{i} & \overset{i n d}{\sim} & \{\begin{matrix} G (\frac{ν + 2}{2}, \frac{ν + z_{C_{i}}^{2}}{2} + \frac{{(y_{i} - η (x_{i}) - ζ z_{C_{i}})}^{2}}{2 ξ^{2}}) & for i \leq n_{1}, \\ G (\frac{ν + 1}{2}, \frac{ν + z_{C_{i}}^{2}}{2}) & for i > n_{1}, \end{matrix} \end{matrix}$

(20)

where $z_{C_{i}} = z_{i} - v_{i}^{⊤} γ .$ Except for the SGPR $_{N}$ and SGPR $_{t_{ν}},$ we need to adopt the Metropolis–Hastings algorithm within the Gibbs sampler because the conditional posterior density of $ω_{i}$ does not have explicit form of known distribution. See [26,27] for the algorithm for sampling $ω_{i}$ from the posterior density.
(6): When the squared exponential covariance function $K (x)$ in Equation (4) is chosen with unknown hyperparameters $u_{0}$ and $w_{0},$ we need to elicit the priors of $u_{0}$ and $w_{0}$ for the full Bayes methods based on the MCMC method. The priors considered by [28] can be used for this assessment as follows. The prior distributions are a conjugate $u_{0} \sim IG (a, b)$ and $w_{0} \sim HC (c, d) .$ Here $HC (c, d)$ denotes the half-Cauchy distribution with the p.d.f. $H C (w_{0}; c, d),$ location parameter $c,$ and scale parameter $d .$ See [28], for compatibility with $w_{0} \sim HC (c, d)$ to elicit the prior information on $w_{0} .$
(7): Full conditional posterior distributions of $u_{0}$ and $w_{0}$ are

$[u_{0} | Θ, w_{0}, D] \sim IG (a^{*}, b^{*}) and p (w_{0} | Θ, u_{0}, D) \propto ϕ_{n_{1}} (η_{n_{1}}; m_{1} (x), K_{11} (x)) H C (w_{0}; c, d),$

where $a^{*} = a + n_{1} / 2$ and $b^{*} = b + u_{0} {(η_{n_{1}} - m_{1} (x))}^{⊤} K_{11} {(x)}^{- 1} (η_{n_{1}} - m_{1} (x)) .$ Note that the conditional posterior density of $w_{0}$ does not have explicit form of known distribution. This implies the use of the Metropolis–Hastings algorithm within the Gibbs sampler to generate $w_{0}$ from the posterior density.
(8): After obtaining the Gibbs samples of $Θ,$ we can use them for Monte Carlo estimation of regression function $η_{n_{2}}$ and missing observations $y_{n_{2}} .$ They can be also used for predicting regression functions and $y_{i}^{'}$ s evaluated at new predictors (see, e.g., [26]).

3.4. Prediction with Bias Corrected Regression Function

According to the Gaussian (i.e., MaxEnt) process prior, the joint distribution of the training outputs (

η_{n_{1}}

) and test outputs

(η_{n_{2}})

is

(\begin{matrix} η_{n_{1}} \\ η_{n_{2}} \end{matrix} | x) \sim N_{n} (m (x) = (\begin{matrix} m_{1} (x) \\ m_{2} (x) \end{matrix}), K (x) = [\begin{matrix} K_{11} (x) & K_{12} (x) \\ K_{21} (x) & K_{22} (x) \end{matrix}]),

where

η_{n} = {(η_{n_{1}}^{⊤}, η_{n_{2}}^{⊤})}^{⊤},

K_{12} (x)

denotes the

n_{1} \times n_{2}

matrix of the covariances evaluated at all pairs of training points

{x_{i} | i = 1, \dots, n_{1}}

and test points

{x_{j} | j = n_{1} + 1, \dots, n},

and similarly for the other entities

K_{11} (x),

K_{21} (x),

K_{22} (x) .

The RSGPR framework provides a straightforward way of predicting test outputs based on the relevant test points and the training outputs. Conditioning the joint Gaussian prior distribution on the training observations, we arrive at the predictive distribution for the future (or missing) regression function given by

[η_{n_{2}} | η_{n_{1}}, x] \sim N_{n_{2}} (m_{2} (x) + K_{21} (x) K_{11} {(x)}^{- 1} (η_{n_{1}} - m_{1} (x)), K_{22} (x) - K_{21} (x) K_{11} {(x)}^{- 1} K_{12} (x)) .

(21)

The regression function (

η_{n_{2}}

) value can be sampled from the predictive distribution (21) by evaluating the mean and covariance matrix of the distribution. Thus, it can be generated within the preceding MCMC algorithm for estimating the RSGPR model: We can generate

η_{n_{2}}

and unobserved observation vector

y_{n_{2}} = {(y_{N_{1} + 1}, \dots, y_{n})}^{⊤}

in the r-th iteration of the algorithm whose Markov chain is augmented by the following conditional distributions.

\begin{matrix} [η_{n_{2}}^{(r)} ∣ η_{n_{1}}^{(r)}, x] & \sim & N_{n_{2}} (m_{2} (x) + K_{21} (x) K_{11} {(x)}^{- 1} (η_{n_{1}}^{(r)} - m_{1} (x)), K_{22} (x) - K_{21} (x) K_{11} {(x)}^{- 1} K_{12} (x)), \\ [y_{n_{2}}^{(r)} ∣ Θ^{(r)}, y_{n_{1}}, x] & \sim & N (η_{n_{2}}^{(r)}, σ^{2, (r)} D_{2} {(δ (ω))}^{(r)}), \end{matrix}

where

η_{n_{2}}^{(r)} = {(η {(x_{n_{1} + 1})}^{(r)}, \dots, η {(x_{n})}^{(r)})}^{⊤}

and

D_{2} {(δ (ω))}^{(r)} = diag {δ {(ω_{n_{1} + 1})}^{(r)}, \dots, δ {(ω_{n})}^{(r)}} .

Let

η_{n_{2}}^{(1)}, \dots, η_{n_{2}}^{(R)}

and

y_{n_{2}}^{(1)}, \dots, y_{n_{2}}^{(R)}

are respective samples generated from R iterations, then bias corrected expected value of

η_{n_{2}}

and that of posterior predictive distribution of

y_{n_{2}}

can be approximated via Monte Carlo by

{\hat{η}}_{n_{2}} = E [η_{n_{2}} | x] \approx \frac{1}{R} \sum_{r = 1}^{R} η_{n_{2}}^{(r)} and E [y_{n_{2}} | y_{n_{1}}, x] \approx \frac{1}{R} \sum_{r = 1}^{R} y_{n_{2}}^{(r)} .

Note that

Cov (η_{n_{2}} | x) = K_{22} (x) - K_{21} (x) K_{11} {(x)}^{- 1} K_{12} (x) .

4. Numerical Illustrations

This section presents empirical results of the Bayesian hierarchical RSGPR analysis of non-normal data with the sample selection. We provide results obtained from simulated data applications comparing the performance of the RSGPR model with that of the GPR model. We developed our program written in R (see, e.g., [29]), which is available from the authors upon request.

4.1. Simulation Scheme

In this simulation, we evaluated the finite-sample performance of the RSGPR model by using sample-selection data generated for different sizes. The performance was assessed in terms of sample-selection bias correction and robustness to non-normal model errors. These could be measured by comparing the posterior estimation and prediction results of the RSGPR model with those based on the GPR model. Specifically, we compared the results obtained from the SGPR

_{N}

(or SGPR

_{t_{10}}

) analysis with the results of the GPR (or GPR

_{t_{10}}

) analysis based on a partially observed sample-selection data. This study also demonstrated that the SGPR

_{t_{ν}}

model is more robust against outliers compared to the SGPR

_{N}

model. To evaluate the performance, we generated

M = 300

sets of partially observed sample-selection data with size

n = 300

with

n_{1} = 150

(i.e., the missing rate is 0.5) from each of the three models (see details below). The general form of the three models is as follows:

\begin{matrix} y_{i} & = & \{\begin{matrix} 50 x_{i} + 5 \sin (10 x_{i}) + ϵ_{i} & for s_{i} = 1, x_{i} \in (0, 1), \\ missing & for s_{i} = 0, i = 1, \dots, n, \end{matrix} \\ z_{i} & = & γ + ε_{i}, (\begin{matrix} ϵ_{i} \\ ε_{i} \end{matrix}) \overset{i i d}{\sim} {SMN}_{2} (0, Σ, δ, G), \end{matrix}

(22)

where

s_{i} = I (z_{i} \geq 0),

γ = 0,

ρ = 0.5,

and

σ = 3 .

Model 1 was defined by assuming that the distribution G degenerates at

ω = 1 .

Model 2 was obtained from the model (22) by setting

δ (ω) = 1 / ω

and

G \sim G (10 / 2, 10 / 2) .

Model 3 assumed a mixture of bivariate normal errors instead of the

{SMN}_{2} (0, Σ, δ, G)

distribution. Throughout our simulation, the hyper-parameters for the Bayesian hierarchical model in Theorem 1 were chosen to reflect the diffuseness of the priors. To obtain the limiting non-informative priors of

ζ,

τ^{2}

and

γ,

their hyper-parameters were assessed as

θ_{0} = 0,

σ_{0} = 10,

c = 0.001,

d = 0.001,

γ_{0} = 0,

and

Ω_{0} = 10 .

Note that when our observational data were augmented through proper prior information, as in this simulation study, the issue to identify the parameters in the RSGPR model disappeared.

In the simulation, we proceeded as follows to estimate the parameters. Using each of

M = 300

datasets generated from the models (Model 1, Model 2, and Model 3), we fitted the RSGPR and GPR models and applied the proposed Bayesian hierarchical methodology to estimate the parameters of the fitted models by assuming the above prior distributions. To implement the methodology by using each generated dataset, we obtained 15,000 posterior samples from the developed MCMC algorithm (in Section 3) with 5 thinning periods after a burn-in period of 5000 samples. This sampling plan guaranteed a convergence of the chain of the MCMC algorithm. The MCMC method (applied to each of

M = 300

datasets) gave estimates (or predictions) of the nonparametric regression function (

η (x)

) as well as the other parameters of the RSGPR model.

The variability in the regression function estimates (

{\hat{η}}_{n_{1}}

) and predictions (

{\hat{η}}_{n_{2}}

) obtained by using a dataset were then visualized as shown in Figure 2 and Figure 3. These figures compare the estimates (or predictions) of the nonparametric regression function obtained from two models (the RSGPR and GPR models). The black line of each graph in the figures shows the true regression function of the model (22). The red dashed line denotes the posterior mean (or predicted value) of the regression function of the model (22) obtained by using a Bayesian hierarchical RSGPR analysis with the sample-selection data of size

n_{1} = 150,

while the blue dashed line depicts that obtained by using a GPR analysis of the sample-selection data. The 97.5th quantile and 2.5th quantile of 3000 posterior samples (predictions) of each regression function (

η (x_{i})

) in the RSGPR model were also calculated. In each figure, these quantiles were used to draw 95% posterior (or prediction) intervals of

η (x_{i})

’s by using the gray band. The accuracy of parameter estimates was calculated by using the mean absolute bias (MAB) and the root mean square error (RMSE):

MAB = \frac{1}{M} \sum_{k = 1}^{M} \sum_{ℓ = 1}^{p} | {\hat{θ}}_{k ℓ} - θ_{ℓ} | and RMSE = {\{\frac{1}{M} \sum_{k = 1}^{M} \sum_{ℓ = 1}^{p} {({\hat{θ}}_{k ℓ} - θ_{ℓ})}^{2}\}}^{1 / 2},

where

M = 300

and

{\hat{θ}}_{k ℓ}

is the posterior estimate of ℓ-th element of

p \times 1

parameter vector

θ

in the k-th replication.

4.2. Performance of the RSGPR Model

4.2.1. Sample-Selection Data Generated from Model 1

If the distribution G is degenerated at

ω = 1,

we then obtain the SGPR

_{N}

model from the RSGPR model (6). Using each of

M = 300

datasets generated from Model 1, the proposed Bayesian hierarchical methodology was applied to estimate the parameters of the model. If we set

ρ = 0,

the methodology yielded posterior estimates of the GPR model. The estimation results for the parameter

η_{n_{1}}

of our primary interest, based on the SGPR

_{N}

and GPR models, are shown in the left panel of Figure 2. The left panel provides the following results. First, the posterior estimates of

η_{n_{1}}

based on the proposed SGPR

_{N}

model (red dashed line) are close to their true values (black line), while those based on the GPR model (blue dashed line) tend to have severe sample-selection bias. Second, when the SGPR

_{N}

model was used to fit the generated sample-selection dataset, the posterior estimates of regression function based on the model concentrated true values of the

η (x_{i})

’s as shown in their 95% posterior intervals (gray band). Third, the difference between the true regression function (black line) and the estimated regression function obtained by using the GPR model (the blue dashed line) confirms Lemma 4, which shows the existence of the sample-selection bias in the GPR regression for data with the sample selection. In summary, the left panel of Figure 2 illustrates the existence of the sample-selection bias in the GPR analysis with the sample-selection data, as discussed in Section 2.3. It also demonstrates the performance of the proposed methodology based on the SGPR

_{N}

model to eliminate the sample-selection bias (or inconsistency in estimating the regression function), which could not be achieved by using the GPR model.

The mean of

M = 300

estimation results for the other parameters were listed in Table 1. As shown in Table 1, the MCMC parameter estimates were close to their true values for the SGPR

_{N}

model, while those based on the GPR model were severely biased. In addition, the MAB and RMSE values in the table ensure that the performance of the SGPR

_{N}

model is far better than the GPR model when the sample selection data was used for Bayesian nonparametric regression analysis. For both models, the small values of Monte Carlo (MC) error (compared to the RMSE) of each parameter suggests that approximate convergence was reached and the sequence generated from the MCMC samples was well mixed.

4.2.2. Data Generated from Model 2

The proposed Bayesian hierarchical methodology for the SGPR

_{t_{10}}

model was applied to each of

M = 300

datasets generated from Model 2. The SGPR

_{t_{10}}

model can be obtained from the SGPR model (6) by setting

δ (ω) = 1 / ω

and

ω \sim G (10 / 2, 10 / 2) .

If we set

ρ = 0,

the methodology could also be used to obtain posterior samples to estimate the GPR

_{t_{10}}

model (GPR model with

t_{10}

errors). The results of the simulation appear in the right panel of Figure 2 and Table 1. Graphs in the right panel of Figure 2 depict the prediction results of

η_{n_{2}}

based on the SGPR

_{t_{10}}

and GPR

_{t_{10}}

models. The graphs clearly show that the sample-selection bias in predicting

η (x_{i})

’s based on the GPR

_{t_{10}}

model is too large to allow for a prediction of the true regression function

η_{n_{2}}

(or future regression function). However, the proposed methodology using the SGPR

_{t_{10}}

model correctly predicted the true regression function; see 95% prediction interval and

{\hat{η}}_{n_{2}},

i.e., red dashed line. The prediction of

η_{n_{2}}

based on the SGPR

_{t_{10}}

model is far better than that based on the GPR

_{t_{10}}

model. Compared to the GPR

_{t_{10}}

model, the methodology based on the SGPR

_{t_{10}}

model yields smaller MAB and smaller RMSE of the parameter estimates; see Table 1. Table 1 shows that the parameter estimates of the SGPR

_{t_{10}}

model with heavy-tailed errors tend to produce larger estimation errors (MAB and RMSE) than those of the SGPR

_{N}

model. The results of the above simulation demonstrate the superior performance of the SGPR

_{t_{10}}

model and the usefulness of the proposed Bayesian hierarchical methodology to remedy the sample-selection bias in the prediction that occurred in the GPR

_{t_{10}}

analysis of the sample-selection data.

4.2.3. Data Generated from Model 3 with Normal Mixture Errors

We generated datasets from Model 3 with size

n = 300 .

Model 3 was defined by the model (22) with independent bivariate normal mixture errors: viz.

0.4 N_{2} (0, Σ_{(1)}) + 0.2 N_{2} (0, Σ_{(2)}) + 0.2 N_{2} (0, Σ_{(4)}) + 0.1 N_{2} (0, Σ_{(8)}) + 0.1 N_{2} (0, Σ_{(16)}),

where 50% of the outcomes were missing in each dataset and

Σ_{(k)}

was equal to

Σ

whose value of

σ^{2}

was

9 k .

The generated dataset was fitted to the SGPR

_{N}

, SGPR

_{t_{5}}

, and SGPR

_{t_{10}}

models in turn. Based on posterior samples, we calculated the Bayes estimates of the three models’ parameters together with their deviance information criterion (DIC) values introduced by [30]. The average DIC values obtained from the dataset were found to be 2727.06, 1477.43, 1396.24 for the SGPR

_{N}

, SGPR

_{t_{10}}

, and SGPR

_{t_{5}}

models, respectively. This suggested that the SGPR

_{t_{5}}

model is the best fitting model among the three models, while the SGPR

_{N}

model is the worst.

The graphs in the left panel of Figure 3 show the true regression function (black line) and estimated regression functions (

{\hat{η}}_{n_{1}}

) under the best fitting SGPR

_{t_{5}}

model (red line) and the SGPR

_{N}

model (blue line). The graphs in the right panel of Figure 3 depict predicted regression function (

{\hat{η}}_{n_{2}}

) based on the best fitted model (in red), the SGPR

_{N}

model (in blue), and the true regression function (in black). Even though the best fitted model based on bivariate

t_{5}

error distributions was misspecified, 95% posterior intervals (or prediction intervals) of

η (x_{i})

obtained from the SGPR

_{t_{5}}

model did include the true regression function values (see gray bands in Figure 3). The prediction result of the SGPR

_{t_{5}}

and SGPR

_{N}

models are very wild due to outliers generated by the normal mixture errors, while the graphs of the SGPR

_{t_{5}}

are more robust for the model misspecification.

5. Conclusions

This study considered a MaxEnt approach to develop a Bayesian nonparametric regression analysis of non-normal data with the sample selection. For this purpose, by using Boltzmann’s maximum entropy theorem, we introduced a MaxEnt process regression model that reflects partial prior information for an uncertain regression function. We found that a special case of the MaxEnt regression model reduced to the well-known GPR model. Second, we generalized the GPR model to propose the RSGPR model and explored its theoretical properties. These properties showed that the new model was well-designed to correct the sample-selection bias and implement a robust GPR analysis. Third, we developed a hierarchical RSGPR model based on a stochastic representation of the RSGPR model and proposed a Bayesian hierarchical methodology for the RSGPR analysis of a non-normal data with sample selection. A simulation study showed that the finite sample performance of the proposed methodology eliminated the sample selection bias and estimated the population model parameters with robustness and high accuracy. In a comparative numerical study on the analysis of nonparametric regression models with sample selection data, we found that the estimation results using the RSGPR model outperformed those using the GPR model for both in-sample estimation and out-of-sample forecasts.

The theoretical results of the RSGPR model and the methodology for the RSGPR analysis proposed in this study have several interesting issues that are worth considering further. First, the RSGPR framework using the MaxEnt process prior can be generalized to the so called stochastically constrained RSGPR regression that uses the constrained MaxEnt process as the prior distribution of the regression function with uncertain constraints. Second, an empirical study with real data as well as an asymptotic evaluation, such as consistency, would be particularly noteworthy to explore. For example, estimating monotone regression function with or without uncertainty and testing the monotonicity of the regression function can be considered in the context of a constrained RSGPR analysis with sample-selection data. Finally, the Bayesian hierarchical methodology can be broadened in various regression models with the general class of skew-

SMN

error distributions considered by [11]. For example, this methodology can be applied to a von Bertalanffy growth curve analysis of heavy-tailed fishery data with sample selection (see, e.g., [28]). We hope to address all of these in the future.

Acknowledgments

Research of Hea-Jung Kim was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A1A01057106).

Author Contributions

Hea-Jung Kim: Initiated project plan, led research effort, worked theoretical part, wrote software code, wrote paper. Daehwa Jin: Collected majority of data, wrote software code, analyzed data, contributed to writing of paper. Mihyang Bae: Wrote software code, analyzed data, contributed to writing of paper.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

Appendix A

Appendix A.1. Proof of Lemma 1

Proof.

The proof proceeds along the lines of Corollary 1 of [18] by changing the partial prior information on

θ

to that on

η_{n} (x) .

☐

Appendix A.2. Proof of Lemma 2

Proof.

Equation (8) shows that the distribution of

[y_{i} | η (x_{i}), s_{i} = 1]

is skew-

{SMN}_{p} (C_{i}; θ_{i}, Σ, κ, G)

with the density (8). Thus, the result by [13] yields a stochastic representation of a conditional skew-

{SMN}_{p} (C_{i}; θ, Σ, κ, G)

distribution given on

ω,

which is

\begin{matrix} [y_{i} | ω, η (x_{i}), s_{i} = 1] & \overset{d}{=} & η (x_{i}) + ρ σ Z_{C_{i}} + σ {(1 - ρ^{2})}^{1 / 2} U_{i}, \end{matrix}

where

U_{i}

is independent of

Z_{C_{i}} .

Introducing

ω \sim G (ω)

to the conditional stochastic representation, we have the two-stages of distributional hierarchy for the distribution of

[y_{i} | η (x_{i}) s_{i} = 1] .

☐

Appendix A.3. Proof of Lemma 3

Proof.

For the RSGPR model, let

z_{1} = {(z_{1}, \dots, z_{n_{1}})}^{⊤}

be the latent variables vector which corresponds to the observed vector

y_{n_{1}} .

Then, for fixed

η_{n_{1}}

and

ω

, the joint distribution of

y_{n_{1}}

and

z_{1}

is

{(y_{n_{1}}^{⊤}, z_{1}^{⊤})}^{⊤} \sim N_{2 n_{1}} (μ^{*}, δ (ω) Σ \otimes I_{n_{1}}),

where

μ^{*} = {(η_{n_{1}}^{⊤}, μ_{1}^{⊤})}^{⊤}

and

μ_{1} = {(v_{1}^{⊤} γ, \dots, v_{n_{1}}^{⊤} γ)}^{⊤} .

This yields the density of selected observations (i.e.,

[y_{n_{1}} | {(δ (ω))}^{- 1 / 2} (z_{1} - μ_{1}) \in C_{α}]

is given by

f (y_{n_{1}} | η_{n_{1}}, Ψ) = [ϕ_{n_{1}} (y_{1}; η_{n_{1}}, δ (ω) σ^{2} I_{n_{1}}) {\bar{Φ}}_{n_{1}} (C_{α}; \frac{ρ}{σ} (y_{n_{1}} - η_{n_{1}}), (1 - ρ^{2}) I_{n_{1}})] / {\bar{Φ}}_{n_{1}} (C_{α}; 0, I_{n_{1}}),

where

ϕ_{p} (\cdot; μ, Σ)

is the p-variate normal density with mean vector

μ

and covariance matrix

Σ,

and

{\bar{Φ}}_{p} (C; μ, Σ) = \int_{C} ϕ_{p} (v; μ, Σ) d v .

Applying the Gaussian process prior

p_{0} (η_{n_{1}}) \propto exp \{- \frac{1}{2} ({(η_{n_{1}} - m_{1} (x))}^{⊤} K_{11} {(x)}^{- 1} (η_{n_{1}} - m_{1} (x))\}

to the likelihood yields a conditional posterior density:

\begin{matrix} p (η_{n_{1}} | y_{n_{1}}, Ψ) = [ϕ_{n_{1}} (η_{n_{1}}; θ_{1}, Ω_{1}) {\bar{Φ}}_{n_{1}} (C_{α}; θ_{2} + Γ Ω^{- 1} (η_{n_{1}} - θ_{1}), Ω_{2} - Γ Ω_{1}^{- 1} Γ^{⊤})] / {\bar{Φ}}_{n_{1}} (C_{α}; θ_{2}, Ω_{2}), \end{matrix}

which is the skew-normal distribution whose properties were well developed by [13,23]. According to this literature, we can easily obtain the stochastic representation (10). ☐

Appendix A.4. Proof of Corollary 2

Proof.

Setting

η (ω) = 1

(i.e., the distribution of

ω

degenerates at

ω = 1

and

η (ω) = ω,

the RSGPR model reduces to the SGPR

_{N}

model. Applying Lemma 3 for the SGPR

_{N}

model and the mean of a truncated normal distribution given in [31] yield the conditional posterior mean of the regression function:

E [η_{n_{1}} | y_{n_{1}}, Ψ] = θ_{1} + Γ Ω_{2}^{- 1} E [W_{1}^{C_{β}}] .

Difference between this posterior mean with that in the first

n_{1} \times 1

sub-vector of the Equation (5) gives the result. ☐

Appendix A.5. Proof of Corollary 3

Proof.

Under the RSGPR model

E [y_{i} | η (x_{i}), s_{i} = 1] = η (x_{i}) + ρ σ E [δ_{1} (v_{i}^{⊤} γ, ω)]

by Lemma 2 and the expression (see, e.g., [31]) of

E [Z_{C_{i}} | ω],

where

η (x_{i})

is the regression function and

α_{i} = - v_{i}^{⊤} γ .

A straightforward derivation of

E [y_{i} | η (x_{i}), s_{i} = 1]

with respect to

x_{k i}

gives

\frac{\partial E [y_{i} | η (x_{i}), s_{i} = 1]}{\partial x_{k i}} = \frac{\partial η (x_{i})}{\partial x_{k i}} + γ_{k} ρ σ E_{ω} [\frac{1}{δ (ω)} (δ_{2} (v_{i}^{⊤} γ, ω) - δ_{1} {(v_{i}^{⊤} γ, ω)}^{2})] .

Comparing with

\frac{\partial E [y_{i} | η (x_{i})]}{\partial x_{k i}} = \frac{\partial η (x_{i})}{\partial x_{k i}}

for the GPR model, we see that the expression (11) is the magnitude of the sample-selection bias in estimating the marginal effect of the k-th independent variable. ☐

Appendix A.6. Proof of Theorem 1

Proof.

The first four stages of the distributional hierarchy reduce to the stochastic representation in Lemma 2 where marginal density of

[y_{i} | s_{i} = 1]

is

h (y_{i})

which is given by Equation (8). We also see that the 2nd to 4th stages of the distributional hierarchy yield the probability mass function

p (s_{i}),

which is

\int_{0}^{\infty} \int_{- \infty}^{\infty} p (s_{i} | z_{i}, ω_{i}) ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i})) d z_{i} d G (ω_{i}) = \bar{F} {(C_{i}; 0, 1)}^{s_{i}} {(1 - \bar{F} (C_{i}; 0, 1))}^{1 - s_{i}} .

This is equivalent to Equation (7). As a result, the logarithm of the joint density of the n pairs of independent observations,

(y_{i}, s_{i})

under the hierarchy is equal to the right hand side of Equation (12). Thus, the first four stages of the hierarchy defines a hierarchical RSGPR model. Introducing the

GP

prior of

η_{n_{1}}

and priors of

Ψ

to elicit our prior information about them, we have the Bayesian hierarchical model. ☐

Appendix A.7. Derivation of Conditional Posterior Distributions

(1): Full conditional distribution of $η_{n_{1}}$ : Equation (13) states that the full conditional density of $η_{n_{1}}$ is

$\begin{matrix} p (η_{n_{1}} | Θ_{∖ η_{n_{1}}}, D_{n}) & \propto & ϕ_{n_{1}} (y_{n_{1}}; η_{n_{1}} + ζ z_{C}, τ^{2} D_{1} (δ (w))) ϕ_{n_{1}} (η_{n_{1}}; m_{1} (x), K_{11} (x)), \\ \propto & \exp \{- \frac{1}{2} η_{n_{1}}^{⊤} Σ_{η_{n_{1}}}^{- 1} η_{n_{1}} + θ_{η_{n_{1}}}^{⊤} Σ_{η_{n_{1}}}^{- 1} η_{n_{1}}\}, \\ \propto & \exp \{- \frac{1}{2} {(η_{n_{1}} - θ_{η_{n_{1}}})}^{⊤} Σ_{η_{n_{1}}}^{- 1} (η_{n_{1}} - θ_{η_{n_{1}}})\}, \end{matrix}$

which is the kernel of $N_{n_{1}} (θ_{η_{n_{1}}}, Σ_{η_{n_{1}}})$ distribution.
(2): Full conditional distribution of $τ^{2}$ : We see from Equation (13) that the full conditional posterior density is

$\begin{matrix} p (τ^{2} ∣ Θ_{∖ τ^{2}}, D_{n}) & \propto & \prod_{i = 1}^{n_{1}} ϕ (y_{i}; η (x_{i}) + ζ z_{C_{i}}, δ (ω_{i}) τ^{2}) I G (τ^{2}; c, d) ϕ (ζ; θ_{0}, σ_{0} τ^{2}), \\ \propto & τ^{- (n_{1} + 2 c + 3)} \exp \{- (d + \frac{1}{2} \sum_{i = 1}^{n_{1}} \frac{{(y_{i} - η (x_{i}) - ζ z_{C_{i}})}^{2}}{δ (ω_{i})} + \frac{{(ζ - θ_{0})}^{2}}{2 σ_{0}}) / τ^{2}\} . \end{matrix}$

This is the kernel of $IG (c + \frac{n_{1} + 1}{2}, d + \frac{1}{2} \sum_{i = 1}^{n_{1}} \frac{{(y_{i} - η (x_{i}) - ζ z_{C_{i}})}^{2}}{δ (ω_{i})} + \frac{{(ζ - θ_{0})}^{2}}{2 σ_{0}})$ distribution.
(3): Full conditional distribution of $ζ$ : Equation (13) gives the full conditional density of $ζ$ given by

$\begin{matrix} p (ζ ∣ Θ_{∖ ζ}, D_{n}) & \propto & \prod_{i = 1}^{n_{1}} ϕ (y_{i}; η (x_{i}) + ζ z_{C_{i}}, δ (ω_{i}) τ^{2}) ϕ (ζ; θ_{0}, σ_{0} τ^{2}), \\ \propto & \exp \{- \frac{ζ^{2} - 2 θ_{ζ} ζ}{2 σ_{ζ}^{2}}\} \propto \exp \{- \frac{{(ζ - θ_{ζ})}^{2}}{2 σ_{ζ}^{2}}\}, \end{matrix}$

which is the kernel of $N (θ_{ζ}, σ_{ζ}^{2})$ distribution.
(4): Full conditional distributions of $z_{i}$ ’s: Equation (13) indicates that the full conditional posterior densities of $z_{i}$ s are mutually independent, and that for each i,

$\begin{matrix} p (z_{i} ∣ Θ_{∖_{i}}, D_{n}) & \propto & {[ϕ (y_{i}; η (x_{i}) + ζ (z_{i} - v_{i}^{⊤} γ), δ (ω_{i}) τ^{2}) ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i}))]}^{s_{i}} {[ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i}))]}^{1 - s_{i}} \\ \propto & [ϕ (z_{i}; θ_{z_{i}}, σ_{z_{i}}^{2} {)]}^{s_{i}} {[ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i}))]}^{1 - s_{i}} . \end{matrix}$

Since the support of $z_{i}$ is ${z_{i}; z_{i} \geq 0}$ for $s_{i} = 1,$ while ${z_{i}; z_{i} < 0}$ for $s_{i} = 0,$ we have the truncated normal distributions.
(5): Full conditional distribution of $γ$ : The full conditional posterior density of $γ$ is given by

$\begin{matrix} p (γ ∣ Θ_{∖ γ}, D_{n}) & \propto & \prod_{i = 1}^{n_{1}} ϕ (y_{i}; η (x_{i}) + ζ (z_{i} - v_{i}^{⊤} γ), δ (ω_{i}) τ^{2}) \\ \times & ϕ_{q} (γ; γ_{0}, Ω_{0}) \prod_{i = 1}^{n} ϕ (z_{i}; v_{i}^{⊤} γ, δ (ω_{i})) \\ \propto & \exp \{- \frac{1}{2} {(γ - θ_{γ})}^{⊤} Σ_{γ}^{- 1} (γ - θ_{γ})\}, \end{matrix}$

which is the kernel of $N_{q} (θ_{γ}, Σ_{γ})$ distribution.

References

Cox, G.; Kachergis, G.; Shiffrin, R. Gaussian process regression for trajectory analysis. In Proceedings of the Annual Meeting of the Cognitive Science Society, Sapporo, Japan, 1–4 August 2012; Volume 34. [Google Scholar]
Rasmussen, C.E.; Nickisch, H. Gaussian process for machine learning (gpml) toolbox. J. Mach. Learn. Res. 2010, 11, 3011–3015. [Google Scholar]
Liutkus, A.; Badeau, R.; Richard, G. Gaussian processes for underdetermined source separation. IEEE Trans. Signal Process. 2011, 59, 3155–3167. [Google Scholar] [CrossRef]
Caywood, M.S.; Roberts, D.M.; Colombe, J.B.; Greenward, H.S.; Weiland, M.Z. Gaussian Process Regression for Predictive But Interpretable Machine Learning Models: An Example of Predicting Mental Worklosd across Tasks. Front. Hum. Neurosci. 2017, 10, 1–19. [Google Scholar] [CrossRef] [PubMed]
Contreras-Reyes, J.E.; Arellano-Valle, R.B.; Canales, T.M. Comparing growth curves with asymmetric heavy-tailed errors: Application to the southern blue whiting (Micromesistius australis). Fish. Res. 2014, 159, 88–94. [Google Scholar] [CrossRef]
Heckman, J.J. Sample selection bias as a specification error. Econometrica 1979, 47, 153–161. [Google Scholar] [CrossRef]
Marchenko, Y.V.; Genton, M.G. A Heckman selection-t model. J. Am. Stat. Assoc. 2012, 107, 304–317. [Google Scholar] [CrossRef]
Ding, P. Bayesian robust inference of sample selection using selection t-models. J. Multivar. Anal. 2014, 124, 451–464. [Google Scholar] [CrossRef]
Hasselt, V.M. Bayesian inference in a sample selection model. J. Econ. 2011, 165, 221–232. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Contreras-Reyes, J.E.; Stehlík, M. Generalized skew-normal negentropy and its application to fish condition factor time series. Entropy 2017, 19, 528. [Google Scholar] [CrossRef]
Kim, H.-J.; Kim, H.-M. Elliptical regression models for multivariate sample-selection bias correction. J. Korean Stat. Soc. 2016, 45, 422–438. [Google Scholar] [CrossRef]
Kim, H.-J. Bayesian hierarchical robust factor analysis models for partially observed sample-selection data. J. Multivar. Anal. 2018, 164, 65–82. [Google Scholar] [CrossRef]
Kim, H.-J. A class of weighted multivariate normal distributions and its properties. J. Multivar. Anal. 2008, 99, 1758–1771. [Google Scholar] [CrossRef]
Lenk, P.J. Bayesian inference for semiparametric regression using a Fourier representation. J. R. Stat. Soc. Ser. B. 1999, 61, 863–879. [Google Scholar] [CrossRef]
Fahrmeir, L.; Kneib, T. Bayesian Smoothing and Regression for Longitudial, Spatial and Event History Data; Oxford Statistical Science Series; Oxford University Press: Oxford, UK, 2011; Volume 36. [Google Scholar]
Chakraborty, S.; Ghosh, M.; Mallick, B.K. Bayesian nonlinear regression for large p and small n problems. J. Multivar. Anal. 2012, 108, 28–40. [Google Scholar] [CrossRef]
Leonard, T.; Hsu, J.S.J. Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers; Cambridge University Press: New York, NY, USA, 1999. [Google Scholar]
Kim, H.-J. A two-stage maximum entropy prior of location parameter with a stochastic multivariate interval constraint and its properties. Entropy 2016, 18, 188. [Google Scholar] [CrossRef]
Shi, J.; Choi, T. Monographs on Statistics and Applied Probability, Gaussian Process Regression Analysis for Functional Data; Chapman & Hall: London, UK, 2011. [Google Scholar]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Process for Machine Learning; The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Andrews, D.F.; Mallows, C.L. Scale mixtures of normal distributions. J. R. Stat. Soc. Ser. B 1974, 36, 99–102. [Google Scholar]
Lachos, V.H.; Labra, F.V.; Bolfarine, H.; Ghosh, P. Multivariate measurement error models based on scale mixtures of the skew-normal distribution. Statistics 2010, 44, 541–556. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Branco, M.D.; Genton, M.G. A unified view on skewed distributions arising from selection. Can. J. Stat. 2006, 34, 581–601. [Google Scholar] [CrossRef]
Kim, H.J.; Choi, T.; Lee, S. A hierarchical Bayesian regression model for the uncertain functional constraint using screened scale mixture of Gaussian distributions. Statistics 2016, 50, 350–376. [Google Scholar] [CrossRef]
Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
Ntzoufras, I. Bayesian Modeling Using WinBUGS; Wiley: New York, NY, USA, 2009. [Google Scholar]
Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
Gelman, A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal. 2006, 1, 515–534. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017; ISBN 3-900051-07-0. [Google Scholar]
Spiegelhalter, D.; Best, N.; Carlin, B.; van der Linde, A. Bayesian measure of model complexity and fit (with discussion). J. R. Stat. Soc. Ser. B 2002, 64, 583–639. [Google Scholar] [CrossRef]
Johnson, N.L.; Kotz, S.; Balakrishnan, N. Distribution in Statistics: Continuous Univariate Distributions, 2nd ed.; John Wiley & Son: New York, NY, USA, 1994; Volume 1. [Google Scholar]

Figure 1. Graphs of the sample-selection bias and the difference in marginal effect of the k-th predictor.

Figure 2. Graphs of estimated regression functions (left panel) and predicted regression functions (right panel): (i) black lines are used for the true regression function; (ii) red dashed lines for the robust sample-selection Gaussian process regression (RSGPR) models; (iii) blue dashed lines for the Gaussian process regression (GPR) models.

Figure 3. Graphs of regression functions: estimated regression functions (left panel) and predicted regression functions (right panel).

Table 1. Posterior Summary.

True Value	Mean	s.d.	SGPR $_{N}$ Model		MC Error	Mean	s.d.	GPR Model		MC Error
True Value	Mean	s.d.	RMSE	MAB	MC Error	Mean	s.d.	RMSE	MAB	MC Error
$σ = 3$	2.831	0.308	0.351	0.426	0.018	2.094	0.104	0.912	0.800	0.002
$ρ = 0.5$	0.380	0.376	0.563	0.287	0.064	NA	NA	NA	NA	NA
			SGPR $_{t_{10}}$ Model					GPR $_{t_{10}}$ Model
$σ = 3$	2.880	0.974	0.509	0.515	0.050	2.130	0.109	0.876	0.800	0.003
$ρ = 0.5$	0.435	0.275	0.627	0.422	0.032	NA	NA	NA	NA	NA

s.d.: standard deviation; SGPR

_{N}

: sample-selection Gaussian process normal error regression; RMSE: root mean square error; MAB: mean absolute bias; MC: Monte Carlo; GPR: Gaussian process regression.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.-J.; Bae, M.; Jin, D. On a Robust MaxEnt Process Regression Model with Sample-Selection. Entropy 2018, 20, 262. https://doi.org/10.3390/e20040262

AMA Style

Kim H-J, Bae M, Jin D. On a Robust MaxEnt Process Regression Model with Sample-Selection. Entropy. 2018; 20(4):262. https://doi.org/10.3390/e20040262

Chicago/Turabian Style

Kim, Hea-Jung, Mihyang Bae, and Daehwa Jin. 2018. "On a Robust MaxEnt Process Regression Model with Sample-Selection" Entropy 20, no. 4: 262. https://doi.org/10.3390/e20040262

APA Style

Kim, H.-J., Bae, M., & Jin, D. (2018). On a Robust MaxEnt Process Regression Model with Sample-Selection. Entropy, 20(4), 262. https://doi.org/10.3390/e20040262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On a Robust MaxEnt Process Regression Model with Sample-Selection

Abstract

1. Introduction

2. Robust Sample-Selection GPR Model

2.1. MaxEnt Process Regression Model

2.2. Proposed Model

2.3. The Sample-Selection Bias

3. Bayesian Hierarchical Methodology

3.1. Hierarchical Representation of the RSGPR Model

3.2. Full Conditional Posteriors

3.3. Markov Chain Monte Carlo Method

3.4. Prediction with Bias Corrected Regression Function

4. Numerical Illustrations

4.1. Simulation Scheme

4.2. Performance of the RSGPR Model

4.2.1. Sample-Selection Data Generated from Model 1

4.2.2. Data Generated from Model 2

4.2.3. Data Generated from Model 3 with Normal Mixture Errors

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Lemma 2

Appendix A.3. Proof of Lemma 3

Appendix A.4. Proof of Corollary 2

Appendix A.5. Proof of Corollary 3

Appendix A.6. Proof of Theorem 1

Appendix A.7. Derivation of Conditional Posterior Distributions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI