Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors

Tobias, Justin L.

doi:10.3390/math13071162

Open AccessFeature PaperArticle

Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors

by

Justin L. Tobias

Economics Department, Purdue University, West Lafayette, IN 47907, USA

Mathematics 2025, 13(7), 1162; https://doi.org/10.3390/math13071162

Submission received: 26 February 2025 / Revised: 25 March 2025 / Accepted: 27 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Bayesian Statistics and Causal Inference)

Download

Browse Figures

Versions Notes

Abstract

:

A procedure for Bayesian nonparametric regression is described that automatically adjusts the degree of smoothing as the curvature of the underlying function changes. Relative to previous work adopting a similar approach that either employs a single global smoothing parameter or assumes that the smoothing process follows a random walk, the model considered here permits adaptive smoothing and imposes stationarity in the autoregressive smoothing process. An efficient Markov Chain Monte Carlo (MCMC) scheme for model estimation is fully described for this stationary case, and the performance of the method is illustrated in several generated data experiments. An application is also provided, analyzing the relationship between behavioral problems in students and academic achievement. Point estimates from the nonparametric methods suggest (a) expected achievement declines monotonically with a behavioral problems index (BPI) score and (b) the rate of decline is relatively flat at the left tail of the BPI distribution and then becomes sharply more negative.

Keywords:

Bayesian; MCMC; nonparametric regression

MSC:

62G08

1. Introduction

The literature on nonparametric Bayesian estimation of regression relationships is substantial, as a variety of approaches have been advanced to successfully recover shapes and properties of unknown functions in the presence of observational data. One approach to Bayesian nonparametric regression, for example, employs basis functions or splines to flexibly model regression relationships (e.g., Smith and Kohn [1], Kohn et al. [2], and Chib et al. [3]). These approaches typically require the researcher to choose a functional form for the basis and the potential locations of a series of knots, and as estimation proceeds, the data are used to determine if terms in the series representation are retained or discarded. Other studies employ Gaussian processes (GPs) for this purpose (e.g., Rasmussen [4] and Dixon et al. [5] (in particular, Chapter 3 of Dixon et al. [5] provides considerable treatment of the GP approach to Bayesian nonparametric regression)). GP models, which require the researcher to choose a mean function and covariance kernel, place a prior distribution over the function space, and plausible functional realizations from the associated stochastic process are identified given the data that are observed. Still another set of studies estimate relationships nonparametrically by treating each point on the regression surface as a parameter to estimate, and then by representing the nonparametric relationship in traditional regression form. Carefully constructed priors are employed in this setting that impose similarity among neighboring points on the regression surface, thereby smoothing out what might otherwise be an erratic collection of regression coefficients. This paper follows in the path of this “smoothness prior” approach and extends it by allowing for local adaptability in the degree of smoothing, by imposing stationarity in the smoothing parameter process, and by allowing the data to update some features of how the smoothing parameters evolve.

To generally describe the approach taken, the case where our covariate, say x, is discrete-valued serves as a useful explanatory starting point. When seeking to estimate a regression function

f (x)

with such data, one would likely begin by creating a series of dummy/indicator variables which exhaust all possible x outcomes. The resulting collection of estimated dummy variable coefficients when proceeding this way, however, may produce a rather erratic estimate of the regression function, particularly when there are few observations within some of the discrete cells and/or the error variance is large. To help remedy this problem, one could adopt an informative prior expressing that adjacent dummy variable coefficients should be similar in value, essentially borrowing informational strength across the discrete cells. Expressions of such local “similarity” can be achieved, for example, by constructing a prior that centers differences of the dummy variable coefficients over zero. The extent of shrinkage/smoothing that takes place is then determined by how tightly those differences are centered over zero.

Studies such as Koop and Poirier [6], Koop and Tobias [7], Chib and Jeliazkov [8], and Chan and Tobias [9], for example, employ this smoothness prior approach. Their specifications contain a single variance parameter in the construction of the smoothness prior, which is akin to a constant global bandwidth parameter in frequentist kernel-based nonparametric regression. While these methods generally perform quite well, they are nonetheless somewhat limited by the unique smoothing parameter: the same degree of smoothing is employed over regions of the support where the function is flat or linear, and over regions where the function may be rapidly changing. Tobias and Bond [10] extend this single-parameter model, allow for adaptability in the degree of smoothing, and structure their prior so that the smoothing parameters follow a random walk. Although the adaptability afforded by such an algorithm may offer real benefit in empirical work, the assumed random walk process is nonetheless somewhat limiting as it imposes, for example, that prior variances grow as one moves across the support of x.

In this paper, we describe an alternative to the random walk model which may offer improved performance in practice. We allow the data to partially inform the way smoothness evolves by assuming the smoothing parameters follow an AR(1) process rather than a random walk. A truncated prior for the autoregressive coefficient is employed to impose stationarity of the smoothing parameter process. This elaborated structure introduces several new parameters to the model (e.g., a stationary mean and an autoregressive coefficient), and an efficient Markov Chain Monte Carlo (MCMC) scheme is derived and employed for estimating this generalized model. We show in several generated experiments that the corresponding AR(1) smoothing process can adapt more quickly than the random walk to functional curvature changes when they occur. Moreover, our experience suggests that the AR(1) structure tends to be computationally more stable and perhaps less sensitive to changes in prior hyperparameters. Offsetting this, the random walk model contains fewer parameters than the AR(1), as it imposes that the autoregressive coefficient equals unity, which both simplifies the resulting estimation algorithm and could potentially lead to increased precision. In some of the conducted experiments, support for the random walk specification is revealed, and in those cases the mass of the posterior distribution of the autoregressive coefficient is seen to concentrate near the unit root. In other cases, we find evidence against the random walk, and in those cases, the posterior distribution of the autoregressive coefficient concentrates in the interior of the support and away from the unit root.

The outline of this paper is as follows. The next section outlines our model and describes the prior specifications used to conduct nonparametric estimation and smoothing. Section 3 fully describes an MCMC scheme for model estimation, while generated data experiments are conducted in Section 4. Section 5 provides an application investigating the impact of behavioral problems in adolescents on achievement scores in mathematics and two measures of reading acumen. In this application, we find some evidence of nonlinearities, which are then further supported by simpler parametric models whose specifications are guided by our initial exploratory nonparametric findings. Specifically, additional behavioral problems are found to have a relatively small impact on achievement when the extent of behavioral issues is currently small, while increasing the number of behavioral problems has clearly negative impacts on achievement when the number of behavioral issues is already moderate or large. Finally, this paper concludes with a summary in Section 6.

2. The Model

Before describing our model, we establish a few notational conventions. Univariate random variables (i.e., x) are denoted with lowercase letters or symbols, while vector-valued quantities (i.e.,

x

) are presented in bold lowercase font. Matrices (i.e.,

X

) are denoted with bold capital letters or Greek symbols, as appropriate.

In this paper, we will introduce a new framework for estimating the conditional mean function in a univariate nonparametric regression model. Before describing our approach, we note that many extensions of this base model are easily handled. For example, the posterior simulator can be easily adapted to include additional controls in a partially linear framework:

y_{i} = f (x_{i}) + z_{i} β + u_{i}

. Non-normality of the disturbance vector can also be accommodated through scale or finite mixtures (see, e.g., Chan et al. [11], chapter 15, and associated references). The strategy for nonparametric estimation described below can also be included as a component of a larger treatment response model when seeking to identify the causal effect of a covariate or set of covariates on an outcome of interest (e.g., Kline and Tobias [12]). We fix ideas here, however, on the basic univariate nonparametric regression model in order to focus attention on the contributions of our estimation method:

y_{i} = f (x_{i}) + ϵ_{i}, i = 1, 2, \dots, n, ϵ | X \sim N (0, σ^{2} I_{n})

where

N (μ, Σ)

denotes a normal/Gaussian distribution with mean

μ

and covariance matrix

Σ

. The covariate x takes on

k_{x} \leq n

distinct values. We let

x_{1}^{*}, x_{2}^{*}, \dots, x_{k_{x}}^{*}

denote the realized distinct ordered values of the random variable x so that

x_{j}^{*} > x_{j - 1}^{*} \forall j = 2, 3, \dots k_{x}

. Furthermore, we let

Δ_{j}

denote the distance between consecutive

x^{*}

values, i.e.,

Δ_{j} \equiv x_{j}^{*} - x_{j - 1}^{*}

.

Our approach treats each

f (x_{j}^{*})

as a parameter to estimate. Smoothness will be achieved by specifying a prior expressing local similarity among neighboring

f^{'} (x^{*})

values, and the function is recovered by piecing together the collection of smoothed coefficients. We begin this process by constructing a (

1 \times k_{x}

) label vector

d_{i},

for each observation i, as follows:

d_{i} = [1 (x_{i} = x_{1}^{*}) 1 (x_{i} = x_{2}^{*}) \dots 1 (x_{i} = x_{k_{x}}^{*})],

where

1 (\cdot)

denotes a standard indicator function. The vector

d_{i}

contains zeros everywhere except for a one in the position that corresponds to the observed value of

x_{i}

. We let

D

denote the stacked

n \times k_{x}

label matrix for the full sample and

π

denote the associated

(k_{x} \times 1)

vector of (sorted) dummy variable coefficients:

D = [\begin{matrix} d_{1} \\ d_{2} \\ ⋮ \\ d_{n} \end{matrix}], π = [\begin{matrix} π_{1} \\ π_{2} \\ ⋮ \\ π_{k_{x}} \end{matrix}] .

With this construction, we can write our nonparametric regression problem in standard and familiar regression form as

y = D π + ϵ .

In the following section, we describe how this method is operationalized through the construction of smoothness priors on

π

. The priors employed will also be adaptive, automatically adjusting the degree of smoothing as the curvature of the function f changes.

Priors and Regression Function Smoothing

We smooth the collection of

π

coefficients via a prior which imposes that neighboring pointwise slopes of the regression function should be similar in value. Such priors have been employed previously by, for example, Koop and Poirier [6], Koop and Tobias [7], Chib and Jeliazkov [8], Chib et al. [13], and Bond and Tobias [10], among others. The prior employed here will be different, however, as it does not employ a single global smoothing parameter. Instead, it associates every cell with its own smoothing parameter to permit local adaptability, while imposing stationarity in the smoothing parameter distribution.

We begin by defining the

(k_{x} \times k_{x})

second-order differencing matrix:

H_{π} = [\begin{matrix} 1 & 0 & 0 & 0 & \dots & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & \dots & 0 & 0 & 0 \\ Δ_{2}^{- 1} & - (Δ_{3}^{- 1} + Δ_{2}^{- 1}) & Δ_{3}^{- 1} & 0 & \dots & 0 & 0 & 0 \\ 0 & Δ_{3}^{- 1} & - (Δ_{4}^{- 1} + Δ_{3}^{- 1}) & Δ_{4}^{- 1} & \dots & 0 & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & \dots & Δ_{k_{x} - 1}^{- 1} & - (Δ_{k_{x}}^{- 1} + Δ_{k_{x} - 1}^{- 1}) & Δ_{k_{x}}^{- 1} \end{matrix}] .

The matrix

H_{π}

is constructed so that the quantity

H_{π} π

will select off the first two elements of the dummy variable coefficient vector (

π_{1}

and

π_{2}

) and then select the first differences of adjacent pointwise slopes of our regression function. These slope-first differences will take the form

Δ_{j}^{- 1} (π_{j} - π_{j - 1}) - Δ_{j - 1}^{- 1} (π_{j - 1} - π_{j - 2}), j = 3, 4, \dots, k_{x}

.

We achieve smoothing by imposing a belief that adjacent slopes should be similar in value. We accomplish this through the prior

H_{π} π | π_{0}, η \sim N (X_{π} π_{0}, V_{η}),

(1)

where

X_{π} = [\begin{matrix} I_{2} \\ 0_{(k_{x} - 2) \times 2} \end{matrix}], π_{0} = [\begin{matrix} π_{01} \\ π_{02} \end{matrix}], and V_{η} = [\begin{matrix} exp (η_{1}) & 0 & 0 & \dots & 0 \\ 0 & exp (η_{2}) & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & exp (η_{k_{x}}) \end{matrix}] .

(2)

The

η

parameters are log variances/log volatilities, which govern the degree of smoothing of the regression function and summarize how that smoothing might change across the support of x. To explain the role and interpretation of the

η

parameters in greater detail, first note that the prior in (1) centers first differences of slopes of our regression function over zero. As

exp (η_{i}) \to 0

,

j > 2

, the prior imposes that adjacent slope differences are essentially equal to zero, thus constraining the regression function to be linear. More generally, however, the

exp (η_{j})

can (and typically will) depart from zero. In such a case, the prior expresses preference for local similarity in neighboring function values (thereby achieving “smooth” results) yet does not strictly impose linearity. In regions where the true function behaves according to a linear or near-linear relationship, we would hope to associate those

f (x^{*})

with small

η

values. In contrast, when the function is changing rapidly, a larger

η

would base results on relatively small local neighborhoods, thus allowing for improved detection of local changes in functional curvature. The model as presented offers the potential for adaptability and the automatic determination of the appropriate neighborhood size/degree of smoothing, as

η

can change with the local properties of the underlying function. We provide some examples of the algorithm’s success in adapting in this way in the experiments of Section 4. It is worth noting that many studies pursuing nonparametric estimation through priors like these employ a constant smoothing parameter, specifying

V_{η}

to be diagonal with a common

η^{2}

across all diagonal elements. Our model nests this as a special case and associates each distinct

x_{j}^{*}

with its own smoothing parameter. To operationalize the approach, some additional structure is required which describes how

exp (η)

evolves. We will return to complete this description shortly.

An additional hierarchical prior is placed over

π_{0}

of the form

π_{0} \sim N ({\underset{̲}{μ}}_{π_{0}}, {\underset{̲}{V}}_{π_{0}}),

(3)

where the underline notation is used to denote terminal hyperparameters selected by the researcher. In specifying our model this way, the prior covariance matrix

{\underset{̲}{V}}_{π_{0}}

can be selected with large diagonal elements to let the data speak regarding the first two points on our regression curve. Specifically, if small values of

η

are generally desired in order to smooth the curve, while the elements of

η

are also correlated in their evolution, the resulting prior for the initial conditions may be undesirably concentrated around the prior mean

{\underset{̲}{μ}}_{π_{0}}

. The hierarchical specification in (1) and (3) combined with diffuse choices for

{\underset{̲}{V}}_{π_{0}}

can help mitigate this problem and allow the data to better inform

π_{1}

and

π_{2}

.

As discussed earlier in this section, the variances

{exp (η_{j})}

control the degree of smoothing of our regression function. It seems reasonable to assume that the

η_{j}

parameters, like those of the regression function itself, should be locally similar. While Tobias and Bond [10] express this similarity by assuming

η

smoothly evolves as a random walk, this (potentially undesirably) implies that the prior variance of

η

must increase as one moves into the right tail of the x distribution (Tobias and Bond [10] mention that a stationary version of the process produced similar results, but did not present an algorithm for such estimation. We formally describe such an algorithm in Section 3). When adopting such a specification, the imposed unit root may also imply that

η

adapts to curvature changes in

f (x)

rather slowly, while alternate specifications may permit more rapid adaptability. We provide examples that both favor and oppose the random walk structure in the generated data experiments of the following section.

In what follows, we assume that the elements of

η

follow an autoregressive or AR(1) process:

\begin{matrix} η_{2} & = & μ + ρ (η_{1} - μ) + u_{2} \\ η_{3} & = & μ + ρ (η_{2} - μ) + u_{3} \\ ⋮ \\ η_{k_{x}} & = & μ + ρ (η_{(k_{x} - 1)} - μ) + u_{k_{x}} \end{matrix}

where

u_{j} \overset{i i d}{\sim} N (0, τ^{2}), j = 2, 3, \dots, k_{x} .

The parameter

ρ

measures the degree of persistence in the series, and the process is stationary provided

| ρ | < 1

. We will employ priors that guarantee stationarity by truncating the support of

ρ

accordingly, although results from the nonstationary random walk model may be approximately reproduced when the posterior for

ρ

concentrates on values near, but strictly less than, unity. The smoothing process is initialized by assuming that

η_{1}

is drawn from the stationary distribution of the process, i.e.,

η_{1} \sim N (μ, τ^{2} / (1 - ρ^{2})) .

We refer to this model in what follows as the “stationary” case or the “estimated-

ρ

” case. This is in contrast to the random walk model, which imposes that

ρ = 1

(and thus eliminates

μ

) and adds a prior for the initial condition of the form

η_{1} \sim N (\underset{̲}{μ}, \underset{̲}{V}) .

Other higher-order autoregressive processes for the smoothing parameters could also be entertained, although we restrict attention to the baseline AR(1) model here (in the experiments and application of the following sections, we show that, while learning takes place regarding the persistence parameter

ρ

, it is not strongly identified by the data. This suggests that further generalizations of the smoothing parameter process may prove quite difficult to implement in practice).

To further describe the implied joint prior for

η

, we first define a few objects. We define the

(k_{x} \times k_{x})

matrix

G_{ρ}

,

(k_{x} \times 1)

vector

g_{ρ}

, and diagonal

(k_{x} \times k_{x})

covariance matrix

Ω = Ω (τ^{2}, ρ)

as follows:

G_{ρ} = [\begin{matrix} 1 & 0 & 0 & \dots & 0 & 0 \\ - ρ & 1 & 0 & \dots & 0 & 0 \\ 0 & - ρ & 1 & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 0 & \dots & - ρ & 1 \end{matrix}], g_{ρ} = [\begin{matrix} (1 - ρ^{2}) \\ (1 - ρ) \\ (1 - ρ) \\ ⋮ \\ (1 - ρ) \end{matrix}], Ω = [\begin{matrix} \frac{τ^{2}}{(1 - ρ^{2})} & 0 & 0 & \dots & 0 \\ 0 & τ^{2} & 0 & \dots & 0 \\ 0 & 0 & τ^{2} & \dots & 0 \\ \dots & \dots & \dots & ⋱ & \dots \\ 0 & 0 & 0 & \dots & τ^{2} \end{matrix}] .

With these definitions in hand, we can express the joint prior for

η

as

G_{ρ} (η - ι_{k_{x}} μ) | ρ, τ^{2} \sim N (0, Ω),

(4)

where

ι_{z}

denotes a

z \times 1

vector of ones. The model specification is complete upon choosing priors for

μ

,

ρ

,

τ^{2}

, and

σ^{2}

. In what follows, we make the following choices:

μ \sim N (\underset{̲}{μ}, {\underset{̲}{V}}_{μ}), τ^{2} \sim I G ({\underset{̲}{a}}_{τ}, {\underset{̲}{b}}_{τ}), and σ^{2} \sim I G ({\underset{̲}{a}}_{σ}, {\underset{̲}{b}}_{σ}),

with

I G (\cdot, \cdot)

denoting an inverse gamma distribution (see, e.g., Chan et al. [11], page 442). In addition, we consider two different prior specifications for

ρ

. The first is a discrete prior with J points of support (denoted

{{\underset{̲}{ρ}}_{j}}_{j = 1}^{J}

,

| {\underset{̲}{ρ}}_{j} | < 1 \forall j

):

P r (ρ = {\underset{̲}{ρ}}_{j}) = {\underset{̲}{p}}_{j}, ρ \in {{\underset{̲}{ρ}}_{1}, {\underset{̲}{ρ}}_{2}, \dots, {\underset{̲}{ρ}}_{J}}, \sum_{j = 1}^{J} {\underset{̲}{p}}_{j} = 1, {\underset{̲}{p}}_{j} \in (0, 1) \forall j .

(5)

In addition, we also consider a continuous prior of the form

ρ \sim T N_{(- 1, 1)} ({\underset{̲}{μ}}_{ρ}, {\underset{̲}{V}}_{ρ}),

(6)

where

T N_{(a, b)} (μ, σ^{2})

denotes a normal distribution with mean

μ

and variance

σ^{2}

that is truncated to the interval

(a, b)

(see, e.g., Chan et al. [11] pages 132–134). We observe that the truncation in the prior for

ρ

imposes that the smoothing process is stationary.

3. Posterior Simulation

In this section, we seek to derive a Markov Chain Monte Carlo (MCMC) scheme to estimate our nonparametric regression model. To this end, we begin by noting that the joint posterior distribution of our model parameters is given up to proportionality as

p (π, π_{0}, η, ρ, σ^{2}, τ^{2}, μ | y) \propto ϕ (y; D π, σ^{2} I_{n}) p (ρ) p (σ^{2}) p (τ^{2}) p (μ) p (π_{0}) p (η | μ, ρ, τ^{2}) p (π | π_{0}, η),

(7)

with

p (\cdot)

serving as generic notation for a density or mass function. In (7),

ϕ (x; μ, Σ)

denotes a multivariate normal density function for the random vector

x

with mean vector

μ

and covariance matrix

Σ

. The specific forms of the priors appearing in (7) were given in the previous section, and

p (ρ)

may represent either the continuous or discrete variant of the prior for the autoregressive coefficient

ρ

.

Below we fully describe all steps of our Markov Chain Monte Carlo sampling scheme, which involves iterative simulation from the conditional posterior distributions of the model parameters. With just one exception (the sampling of

ρ

in the continuous case), each step only requires simulation from standard distributions. When required, a well-mixing Metropolis–Hastings (MH) algorithm is suggested for the sampling of

ρ

, and thus the method should be generally accessible by practitioners.

The approach that follows can be interpreted as an approximate (though highly accurate) MCMC procedure. The approximate nature of our results owes to a Gaussian mixture approximation that will be used to facilitate drawing the smoothing parameters/log volatilities

η

. Specifically, we use a 10-component Gaussian mixture approximation of the log of a

χ^{2} (1)

random variable, as in Omori et al. [14], to aid in the sampling of

η

. This will require the introduction of a vector of component indicators, denoted

s

, and the sampling of these indicators will conclude one sweep of the sampler. As discussed in Del Negro and Primiceri [15] in the context of stochastic volatility models, the ordering of the sampling steps is important in this context.

To describe the ordering issue formally, we will conceptually group a subset of parameters together as follows:

γ \equiv {π, π_{0}, σ^{2}, μ, ρ, τ^{2}} .

We will also make use of the notation

γ_{- x}

to denote all elements of

γ

other than x. Each element of

γ

will be sampled from its exact conditional posterior distribution, as derived from (7). Our sampling scheme will draw, in order, from

[η | s, γ, y]

,

[γ | η, y]

, and finally

[s | η, γ, y]

. Sampling from the last two of these amounts to drawing from the joint conditional posterior distribution

[s, γ | η, y]

given that elements of

γ

will be drawn from their exact conditional posteriors, which do not involve any approximations (and thus

s

). Importantly, the sampling of

γ

in this scheme needs to take place immediately before

s

. Before describing each of the required eight steps, we note the structural similarity of the nonparametric regression problem considered here with traditional stochastic volatility models. The reader is invited to see, for example, Chan [16] for related discussions. As described in Kim, Shephard, and Chib [17], one can make adjustments for the mixture approximation error via, for example, importance sampling reweighting of the produced simulations. This is seldom performed in practice, however, as the approximation is highly accurate and such adjustments tend to produce no meaningful differences in results.

Step 1: $η | s, γ, y$ :

In equation form, the prior in (1) can be expressed as

ψ = X_{π} π_{0} + ν,

where

ψ \equiv H_{π} π

and

ν

is multivariate Gaussian with mean zero and diagonal covariance matrix

V_{η}

containing

{exp (η_{i})}_{i = 1}^{k_{x}}

stacked across the diagonal. Rearranging this equation by squaring both sides and taking the natural logarithm, we can express this prior as

log [{(ψ - X_{π} π_{0})}^{⊙ 2}] = η + ζ,

where each element of the

(k_{x} \times 1)

vector

ζ

are

i i d

draws from a

log χ^{2} (1)

distribution and ⊙ denotes the Hadamard or element-wise product. This representation is beneficial, as

η

enters the right-hand side as essentially a vector of intercept parameters, thus opening the door to standard sampling possibilities. However,

ζ

is not Gaussian, which introduces an unfortunate computational complication.

To address this complication, we replace the log

χ^{2} (1)

with a highly accurate Gaussian mixture approximation. Specifically, we use a ten-component mixture approximation, as described in Omori et al. [14]:

ζ_{i} \approx \sum_{s = 1}^{10} p_{s} N (m_{s}, ω_{s}^{2}) .

The component probabilities

p_{s}

, means

m_{s}

, and variances

ω_{s}^{2}

are fixed (rather than estimated) and appear in Table 1 of Omori et al. [14]. A mixture approximation-based analysis thus proceeds by introducing a set of component indicators

s = {s_{i}}_{i = 1}^{k_{x}}

such that

{\tilde{ζ}}_{i} | s_{i} \sim N (m_{s_{i}}, ω_{s_{i}}^{2}), \Pr (s_{i} = j) = p_{j}, j = 1, 2, \dots, 10 .

(8)

Conditioned on the component indicators

s

, we have

log ({[ψ - X_{π} π_{0}]}^{⊙ 2}) = m_{s} + η + \tilde{ζ}, \tilde{ζ} | s \sim N (0, Γ_{s}),

(9)

where

Γ_{s} \equiv diag {ω_{s_{i}}^{2}} .

The prior for

η

in (4) can be expressed as

η \sim N (ι_{k_{x}} μ, G_{ρ}^{- 1} Ω {(G_{ρ}^{- 1})}^{'}) .

(10)

Combining (9) and (10), one can show

η | γ, s, y \sim N (D_{η} d_{η}, D_{η}),

(11)

where

D_{η} = {(Γ_{s}^{- 1} + G_{ρ}^{'} Ω^{- 1} G_{ρ})}^{- 1}, d_{η} = Γ_{s}^{- 1} [log ({[ψ - X_{π} π_{0}]}^{⊙ 2}) - m_{s}] + G_{ρ}^{'} Ω^{- 1} G_{ρ} ι_{k x} μ .

Step 2: $π | γ_{- π}, η, y$

When conducting this step, we choose to sample

ψ = H_{π} π

. Once

ψ

is sampled, the desired smoothed regression coefficients can be calculated at each iteration by simply inverting the mapping, i.e., calculating

π = H_{π}^{- 1} ψ .

Our regression equation can be reparameterized in terms of

ψ

by noting

y = D θ + ϵ = D H_{π}^{- 1} H_{π} π + ϵ = \tilde{D} ψ + ϵ,

where

\tilde{D} \equiv D H_{π}^{- 1}

.

It follows that the conditional posterior distribution of

ψ

is Gaussian:

ψ | γ_{- ψ}, η, y \sim N (D_{ψ} d_{ψ}, D_{ψ}),

(12)

where

D_{ψ} \equiv {({\tilde{D}}^{'} \tilde{D} / σ^{2} + V_{η}^{- 1})}^{- 1}, d_{ψ} \equiv {\tilde{D}}^{'} y / σ^{2} + V_{η}^{- 1} X_{π} π_{0} .

Step 3: $π_{0} | γ_{- π_{0}}, η, y$

From (7), the conditional posterior distribution of the initial conditions

π_{0}

is proportional to the product of the prior

H_{π} π \sim N (X_{π} π_{0}, V_{η})

and the prior for

π_{0}

. This implies

π_{0} | γ_{- π_{0}}, η, y \sim N (D_{π_{0}} d_{π_{0}}, D_{π_{0}}),

(13)

where

D_{π_{0}} = {(X_{π}^{'} V_{η}^{- 1} X_{π} + {\underset{̲}{V}}_{π_{0}}^{- 1})}^{- 1}, d_{π_{0}} = X_{π}^{'} V_{η}^{- 1} ψ + {\underset{̲}{V}}_{π_{0}}^{- 1} {\underset{̲}{μ}}_{π_{0}} .

Step 4: $σ^{2} | γ_{- σ^{2}}, η, y$

From (7), it is evident that the conditional posterior distribution of the error variance

σ^{2}

follows an inverse gamma distribution:

σ^{2} | γ_{- σ^{2}}, η, y \sim I G ({\underset{̲}{a}}_{σ} + \frac{n}{2}, {[{\underset{̲}{b}}_{σ}^{- 1} + \frac{1}{2} {(y - D π)}^{'} (y - D π)]}^{- 1}) .

Step 5: $μ | γ_{- μ}, η, y$

We begin by writing the prior for

η

in (4) as

G_{ρ} η = μ [\begin{matrix} 1 \\ (1 - ρ) \\ ⋮ \\ (1 - ρ) \end{matrix}] + [\begin{matrix} u_{1} \\ u_{2} \\ ⋮ \\ u_{k_{x}} \end{matrix}],

where

V a r (u_{1}) = τ^{2} / (1 - ρ^{2})

and

V a r (u_{j}) = τ^{2} \forall j \geq 2

. The right-hand-side vector

{[1 (1 - ρ) (1 - ρ) \dots]}^{'}

acts as a covariate vector (with known values given

ρ

), and

μ

is its corresponding parameter. One can then show

μ | γ_{- μ}, η, y \sim N (D_{μ} d_{μ}, D_{μ})

where

D_{μ} = {[τ^{- 2} ((1 - ρ^{2}) + (k_{x} - 1) {(1 - ρ)}^{2}) + {\underset{̲}{V}}_{μ}]}^{- 1}, d_{μ} = g_{ρ}^{'} G_{ρ} η / τ^{2} + {\underset{̲}{V}}_{μ}^{- 1} {\underset{̲}{μ}}_{0} .

Step 6: $τ^{2} | γ_{- τ^{2}}, η, y$

Defining

{\tilde{G}}_{ρ} = [\begin{matrix} \sqrt{1 - ρ^{2}} & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 1 \end{matrix}] G_{ρ},

the prior in (4) implies

{\tilde{G}}_{ρ} (η - ι_{k_{x}} μ) \sim N (0, τ^{2} I_{k_{x}}) .

It follows that

τ^{2} | γ_{- τ^{2}}, η, y \sim I G ({\underset{̲}{a}}_{τ} + \frac{k_{x}}{2}, {[{\underset{̲}{b}}_{τ}^{- 1} + \frac{1}{2} {(η - ι_{k_{x}} μ)}^{'} {\tilde{G}}_{ρ}^{'} {\tilde{G}}_{ρ} (η - ι_{k_{x}} μ)]}^{- 1}) .

Step 7: $ρ | γ_{- ρ}, η, y$

For

ρ

, the conditional posterior distribution is proportional to

ρ | γ_{- ρ}, η, y \propto p (ρ) h (ρ) \prod_{i = 2}^{k_{x}} ϕ (η_{i} - μ; ρ (η_{i - 1} - μ), τ^{2}),

where the function

h (ρ)

is defined as

h (ρ) = {(1 - ρ^{2})}^{1 / 2} exp (- \frac{1}{2 τ^{2}} {(η_{1} - μ)}^{2} (1 - ρ^{2})) .

Case 1: Continuous Prior

In the continuous case,

p (ρ)

is given in (6). The function

h (ρ)

introduces a non-conjugacy, as the resulting conditional posterior does not take the form of a standard distribution. We will thus use a Metropolis–Hastings (M-H) step to conduct the sampling.

If we were to ignore the contribution of the initial condition through the term

h (ρ)

, the conditional posterior distribution would be proportional to

ρ | γ_{- ρ}, η, y \sim T N_{(- 1, 1)} (D_{ρ} d_{ρ}, D_{ρ})

(14)

where

D_{ρ} = {(\sum_{i = 2}^{k_{x}} {(η_{i - 1} - μ)}^{2} / τ^{2} + {\underset{̲}{V}}_{ρ}^{- 1})}^{- 1}, d_{ρ} = \sum_{i = 2}^{k_{x}} (η_{i} - μ) (η_{i - 1} - μ) / τ^{2} + {\underset{̲}{V}}_{ρ}^{- 1} {\underset{̲}{μ}}_{ρ} .

Using (14) as a proposal distribution in an M-H step, we sample a candidate

ρ_{c}

from (14) given that our chain is currently at

ρ = ρ_{t}

. The M-H acceptance probability thus reduces to

h (ρ_{c}) / h (ρ_{t})

. Specifically, we draw

u^{*} \sim U (0, 1)

and accept the candidate

ρ_{c}

provided

u^{*} < {(\frac{1 - ρ_{c}^{2}}{1 - ρ_{t}^{2}})}^{1 / 2} exp (\frac{1}{2 τ^{2}} {(η_{1} - μ)}^{2} (ρ_{c}^{2} - ρ_{t}^{2})) .

Otherwise, we set the new value of the chain,

ρ_{t + 1}

, equal to

ρ_{t}

.

Case 2: Discrete Prior

If a discrete prior for

ρ

is specified, as in (5), the posterior is also discrete with support points

{{\underset{̲}{ρ}}_{j}}_{j = 1}^{J}

. Furthermore, the posterior probabilities are given up to proportionality as

P r (ρ = {\underset{̲}{ρ}}_{j} | γ_{- ρ}, η, y) \propto {\underset{̲}{p}}_{j} h ({\underset{̲}{ρ}}_{j}) \prod_{i = 2}^{k_{x}} ϕ (η_{i} - μ; {\underset{̲}{ρ}}_{j} (η_{i - 1} - μ), τ^{2}) .

The expression on the right-hand side of the equation above can be computed for all

j = 1, 2, \dots, J

and then normalized by dividing by the sum of all such values. A draw from the corresponding discrete distribution can then be obtained via the method of inversion.

Step 8: $s | γ, η, y$

The conditional posterior distribution of each component indicator is proportional to

P r (s_{i} = j | γ, η, y) \propto p_{j} ω_{j}^{- 1} ϕ (\frac{log ({[ψ_{i} - x_{i π} π_{0 c}]}^{2}) - m_{j} - η_{i}}{ω_{j}}), j = 1, 2, \dots, 10, and \forall i = 1, 2, \dots, k_{x},

(15)

with

ϕ (\cdot)

denoting the standard normal density function. For a given i, the above can be calculated for all

j = 1, 2, \dots, 10

and then normalized to obtain a valid discrete distribution. The process is then repeated for all i.

4. Generated Data Experiments

In this section, we perform brief generated data experiments to investigate the performance of the stationary/estimated-

ρ

adaptive smoothing procedure outlined in Section 2 and Section 3. We seek to illustrate the method’s ability to recover the shape of an unknown regression function and to contrast the behavior of smoothing parameters under this method with those obtained under other assumed processes, including the random walk.

To this end, we begin by generating 1000 observations from a regression model with interesting changes in curvature. Specifically, we first sample

x_{i} \overset{i i d}{\sim} U (- 2, 1.5)

with

U (a, b)

denoting a uniform distribution on the interval

(a, b)

. To induce some degree of discreteness, consistent with our application of the following section, we round each

x_{i}

value to the second decimal place. We then sample

ϵ_{i} \overset{i i d}{\sim} N (0, 1)

and generate

y

as follows:

y_{i} = - 1 + 0.5 x_{i} + 2 exp (- 100 {(x_{i} + 1)}^{2}) + (\frac{2}{\sqrt{(3 / 5)} π^{1 / 4}}) exp (- \frac{25 x_{i}^{2}}{2}) (1 - 25 x_{i}^{2}) + σ ϵ_{i},

with the error standard deviation

σ = 0.25

. For our priors, we set

{\underset{̲}{μ}}_{π_{0}} = {[- 2 - 2]}^{'}

,

{\underset{̲}{V}}_{π_{0}} = 1.0 \times 10^{5} I_{2}

,

a_{σ} = 3

,

b_{σ} = 5 / 2

,

\underset{̲}{μ} = - 10

,

{\underset{̲}{V}}_{μ} = 16

,

a_{τ} = 3

, and

b_{τ} = 0.5

. For the prior for

ρ

, we consider the discrete prior variant, with

ρ \in {0.5, 0.51, 0.52, \dots, 0.99} .

This support is selected to express preference for at least a moderate degree of positive persistence in the smoothing parameter process. Cases with negative

ρ

may also be considered, although positive persistence is seemingly a natural position to take when modeling regression functions that are believed to be smooth. In experiments not reported here, cases with negative

ρ

tended to produce immediate and large changes in

η

values and difficult-to-interpret oscillatory smoothing patterns as one departs from the stationary mean

μ

.

To construct the prior probabilities

{\underset{̲}{p}}_{j}

for each element of this discrete vector, we first calculate the ordinate of the standard normal density at

{[({\underset{̲}{ρ}}_{j} - μ) / σ]}^{2}

where

μ = 0.8

and

σ = 0.1

. We perform this for each of the discrete

{\underset{̲}{ρ}}_{j}

cells. The resulting collection of values is then normalized by dividing by the sum of all such ordinates to produce a proper prior over this discrete vector of points. Of course, a variety of other methods for assigning/constructing the prior probabilities can also be employed.

Figure 1 provides a summary of the results from this generated data experiment. The top panel of Figure 1 provides a scatterplot of the raw data together with the stationary nonparametric estimates under the employed discrete prior for

ρ

. As shown in the figure, the method performs very well and accurately recovers the shape of and changes in the true regression function. The bottom panel of Figure 1 plots (in red) posterior means of the smoothing parameters

η

across each

x^{*}

value. To gain a sense of how these results might differ from, or be similar to, smoothing results from other approaches (like the random walk), we estimate the model again under two different assumptions regarding the evolution of

η

. In the first model, we simply fix

ρ = 0.3

throughout. In the second model, we consider the random walk specification. Parameters common across approaches are assigned the same priors, with hyperparameters of those priors given previously. Posterior means of the smoothing parameters/log volatilities

η

are presented for each of these two additional specifications in the lower panel of Figure 1.

The lower panel reveals a few interesting patterns and highlights important differences across approaches. First, results under all three priors show evidence of adaptability: values of

η

are relatively large where the function exhibits the most rapid changes in curvature. Conversely, when

f (x)

locally behaves as a linear function,

η

values are relatively small, shrinking results toward linearity over the corresponding region as a consequence. Second, we see differences in the degree of adjustment across approaches with different treatments of the AR parameter

ρ

. The smoothing parameters under the random walk model evolve slowly, exhibiting small local changes in

η

values, while results with estimated-

ρ

and those fixing

ρ = 0.3

adjust quickly by comparison. Near the functional spike at

x = - 1

, for example, all approaches admit the largest value of

η

, while the

ρ = 0.3

results then decline sharply to either side, returning rather quickly to values near the stationary mean

μ

. The approach with estimated-

ρ

lies somewhere in between the random walk and

ρ = 0.3

cases: it resembles overall patterns produced by the random walk model, but adjusts the values of

η

more quickly as one moves away from the peaks in the regression function where curvature changes are most pronounced.

Figure 2 documents how learning takes place in the stationary model regarding the AR coefficient

ρ

under two different discrete priors. The left panel plots the prior (top left) for the analysis used to generate Figure 1 and the corresponding marginal posterior (bottom left). The prior employed is rather informative regarding the degree of persistence, but learning clearly occurs, as the posterior becomes left-skewed and shifts toward values of

ρ

near one. In the right panels of Figure 2, we repeat the analysis, this time increasing the generating

σ

value from 0.1 to 1, thus creating a nearly uniform prior for

ρ

across the discrete cells from 0.5 to 0.99 (top right). The posterior (bottom right) again differs considerably from the prior, and places even more of its mass near the unit root, suggesting preference for the random walk specification (or a close approximation to it) for this experiment. Mean absolute error (MAE), for example (letting

f (x)

denote the true value of the function in the experiment and

\hat{f} (x)

the estimated value (i.e., the posterior mean), we calculate the MAE as

k_{x}^{- 1} \sum_{i = 1}^{k_{x}} | \hat{f} (x_{i}^{*}) - f (x_{i}^{*}) |

), was found to be 0.029 and 0.030 for the random walk and estimated-

ρ

specifications, respectively, and 0.037 when fixing

ρ = 0.3

. The movement of the posterior of

ρ

toward the unit root is consistent with this overall preference for the random walk model in this experiment.

We consider an additional set of experiments, this time generating data from a linear model with

f (x) = - 1 + 0.5 x

. With the exception of

ρ

, we continue to employ the same priors as those just used in the nonlinear experiment. For

ρ

, we again employ a discrete prior, this time over the support points

ρ \in {0.3, 0.31, 0.32, \dots, 0.99}

, and with discrete prior probabilities

{\underset{̲}{p}}_{j}

generated in a similar manner, with

μ = 0.8

and

σ = 0.2

. Results from this experiment are provided in Figure 3.

In the top left panel of Figure 3, we again see that the method performs admirably in recovering the simple linear regression function. In the bottom left panel, we compare posterior means of the smoothing parameters

η

for both the random walk and estimated-

ρ

cases. Here, we see that the random walk specification again slowly evolves, gradually moving toward smaller values of

η

, which further shrink the vector

π

toward linearity. Values of

η

are largest for small x in the random walk specification and then decline, owing to the influence of the

N (- 10, 16)

prior for the initial smoothing parameter

η_{1}

. This prior is chosen to match the prior employed for its counterpart, the stationary mean

μ

, in the estimated-

ρ

case. The relatively slowly adapting random walk specification takes some time to move away from the influence of this initial condition, whereas the estimated-

ρ

model immediately jumps to the stationary mean

μ

and remains there across the support of x.

Posterior means of

η

under the estimated-

ρ

model are uniformly smaller than those of the random walk model and thus more strongly impose linearity—an appropriate and desirable outcome for this example. The right panels of Figure 3 provide additional results, as they plot prior (top) and posterior (bottom) probabilities of

ρ

over its discrete support. Here, we again see that learning has taken place, as the marginal prior and posterior distributions are clearly different. Specifically, the posterior is observed to move toward interior

ρ

values and to place little mass around

ρ \approx 1

, unlike our previous nonlinear example of Figure 1 where the posterior moved toward the unit root. Finally, we note that the MAE for the estimated-

ρ

model was 0.005, while that of the random walk specification was 0.012. Taken together, we see that the linear and nonlinear experiments considered here provide examples where learning about

ρ

clearly occurs. This learning leads to posterior shifts in opposite directions in our two experiments, although both movements are in directions leading to improved fits.

5. Application

We illustrate our methods using data from the National Longitudinal Survey of Youth (NLSY) 1979. The NLSY is a nationally representative sample of young men and women, all born between 1957 and 1964, containing a wealth of demographic, academic, and labor market data on its participants. We use these data to better understand how behavioral problems among adolescents affect their academic achievement, the latter measured through a series of test scores within the NLSY. Specifically, the nonparametric methods derived in Section 2 and Section 3 are employed to flexibly explore the relationship between behavioral problem severity and academic achievement. Before discussing the results of this investigation, we first describe the variables analyzed in more detail.

The NLSY contains scores from the Peabody Individual Achievement Test (PIAT) in both mathematics and reading. The specific variables we extract and separately analyze are the total standard score in mathematics (MATHZ2000) (these parenthetical designations give the variable names in the NLSY), total standard score in reading recognition (RECOGZ2000), and total standard score in reading comprehension (COMPZ2000). Test scores range from 65 to 135 for each of the three tests in our sample, with sample means (and standard deviations) equal to 106.8 (13.5), 109.0 (14.2), and 105.0 (13.5) for math, reading recognition, and reading comprehension, respectively.

The NLSY also contains a supplemental mother’s file, in which mothers of the NLSY participants are asked to respond to a series of questions regarding their children. Included among responses in the mother’s file are a set of 28 questions surrounding specific behaviors that their children may or may not have demonstrated over the preceding three months. These questions touch on characteristics such as antisocial behavior, anxiousness/depression, headstrongness, hyperactivity, immature dependency, and peer conflict/social withdrawal (see https://nlsinfo.org/content/cohorts/nlsy79-children/topical-guide/assessments/behavior-problems-index-bpi (accessed on February 2025) for additional details). In answering these questions, mothers categorize the stated behaviors as occurring “often”, “sometimes”, or “not” (true). An overall index (BPIZX2000) is then constructed in the NLSY (which we denote as the BPI) by aggregating assigned scores across questions, with higher values of the index indicative of higher degrees of behavioral issues. In our final sample, the BPI ranges from a low of 74 to a high of 145, creating

k_{x} = 72

distinct values when constructing

D

as in Section 2. Of these 72 distinct values, nine cells are not represented in the sample, three contain exactly one observation, and no cells contain more than 44 observations. When cells are not empty, the average number of observations associated with each distinct BPI value is approximately 16.3. A seeming advantage of our approach is that our smoothness prior allows us to handle cells that are either infrequently represented or not represented at all; the prior borrows information from neighboring outcomes in forming posterior predictions at those values. Our final sample analyzed consists of

n = 1076

observations, and each test score outcome is considered in a separate analysis. For our priors, we employ the continuous prior for

ρ

, setting

{\underset{̲}{μ}}_{ρ} = 0.8

and

{\underset{̲}{V}}_{ρ} = 0 . 15^{2}

. As discussed in Section 3, posterior simulation thus employs a Metropolis–Hastings step for sampling

ρ

, and associated acceptance rates were very high, equaling 0.91, 0.91, and 0.89 for mathematics, reading recognition, and reading comprehension, respectively. For the remaining hyperparameters, we set

a_{σ} = 3

,

b_{σ} = 0.005

,

{\underset{̲}{μ}}_{0} = - 3

,

{\underset{̲}{V}}_{μ_{0}} = 4

,

a_{τ} = 3

, and

b_{τ} = 0.5

. Each posterior simulator is run for 82,500 iterations, with the first 2500 discarded as the burn-in period.

The results of our analysis are summarized in Figure 4. The upper left panel of the figure provides plots of the nonparametric conditional mean estimates

\hat{E (Score | BPI)} = E (π | y)

for all three tests (solid) along with estimates from simple linear models (dashed). The remaining three panels of the figure provide more detailed results for each test score outcome. These panels plot nonparametric estimates (black, solid), linear estimates (black, dashed), posterior regions associated with the nonparametric estimates (light blue, shaded) (these are calculated as the posterior mean of each element of

π

± 1.96 times the corresponding posterior standard deviation), and dummy variable coefficient estimates (red, dashed). The latter are obtained from OLS regressions of the given test score on a series of indicator variables for all BPI scores observed in the sample.

The upper left panel of the figure reveals several interesting results. First, the nonparametric estimates are suggestive of a pattern of nonlinearity that is consistent across all tests. Specifically, each conditional mean function is somewhat flat, though declining, for low values of the BPI, and then they turn more sharply negative once the BPI exceeds a value of approximately 90. Despite this near initial flatness, posterior means are found to be monotonically decreasing in the BPI across the full BPI support for all of our tests. This clearly suggests that behavioral issues have a globally negative effect on student achievement. When comparing the nonparametric results against linear estimates, another consistent pattern emerges: linear point estimates tend to overstate expected achievement at both tails of the BPI distribution, and this is particularly true at high values of the BPI for the reading recognition test score. Though not reported in the figure, we also note that the marginal posterior distributions of the autoregressive coefficient

ρ

are similar across tests, with posterior means (and standard deviations) equal to 0.78 (0.132), 0.78 (0.134), and 0.79 (0.139) for mathematics, reading recognition, and reading comprehension, respectively.

The remaining three panels of the figure reveal the value of smoothing for this application. Dummy variable coefficient estimates—which may be interpreted as the starting point for the smoothed analysis—are highly variable, making it difficult to clearly identify overall patterns in the data. The nonparametric estimates clearly cut through the middle of this erratic collection and produce results that are easily interpretable by comparison.

Figure 4 also reveals that estimates from the linear specifications do fall within our nonparametric posterior regions for all tests, as there is considerable estimation uncertainty at the right tail of the BPI distribution. We therefore consider an additional model, guided by the exploratory nonparametric analysis, to further investigate the potential existence of nonlinearities in the test score/BPI relationships. Specifically, we estimate a linear spline with a single unknown changepoint. Such an analysis will allow us to determine the most likely location for a change in curvature (slope) if one is present, along with values of the slopes themselves both before and after the change. The model we take to the data is of the form

y_{i} = β_{0} + β_{1} B P I_{i} + β_{2} {(B P I_{i} - λ)}_{+} + ϵ_{i},

(16)

for each of our three test score outcomes. In Equation (16),

λ

denotes our single changepoint, the notation

x_{+} \equiv max {x, 0}

and

ϵ_{i} \overset{i i d}{\sim} N (0, σ^{2})

. It is clear that the spline function permits a change in slope at the value

λ

: the slope of our line is

β_{1}

when the BPI

\leq λ

and

β_{1} + β_{2}

when the BPI

> λ

. We also place a uniform prior across integer-valued

λ

in the set

{79, 80, 81, \dots, 130}

, thus allowing at least a modest number of observations in each of the two slope regimes. For our priors, we choose

[\begin{matrix} β_{0} \\ β_{1} \\ β_{2} \end{matrix}] \sim N ([\begin{matrix} 100 \\ 0 \\ 0 \end{matrix}], [\begin{matrix} 100 & 0 & 0 \\ 0 & 100 & 0 \\ 0 & 0 & 100 \end{matrix}]) a n d σ^{2} \sim I G (3, 0.005) .

The Gibbs sampler is employed to fit the model under these priors, running each posterior simulator for 110,000 iterations and discarding the first 10,000 as the burn-in period. Chan et al [11], Exercise 12.10, provide detailed steps for posterior simulation in a model of this form, and we follow those steps here.

If our earlier interpretations of the initial “flatness” of the regression functions followed by a more precipitous decline are correct, we should see those patterns echoed in the changepoint specification. Specifically, we expect to find (a) a revision of our uniform prior for the changepoint

λ

to a posterior for

λ

with a mode near BPI = 90 and (b) evidence that the slope of our regression function is more negative after the changepoint then before, suggesting that increments to the BPI have more substantial negative impacts on student achievement as one moves into the right tail of the BPI distribution.

Our estimation results provide confirmation of these patterns. Figure 5 plots posterior frequencies of the changepoint

λ

for each of our three test score outcomes. Interestingly, the posterior mode of the changepoint in each case is exactly 91, consistent with Figure 4, which shows a relatively flat function until the BPI ≈ 90, followed by a negative decline.

Table 1 provides posterior means and standard deviations for the regression coefficients

β

and variance parameter

σ^{2}

in the changepoint model. As shown in the table,

β_{1}

—which is interpreted as the pre-changepoint slope—has a posterior mean near zero and a correspondingly large posterior standard deviation. This, again, is consistent with an initial period of relative flatness in the relationship between the BPI and test scores. The regression coefficient

β_{2}

, however, which is interpreted as the change in slope after

λ

, has consistently negative posterior means across test scores and places most of its mass over negative values: the posterior probability that

β_{2} < 0

is 0.93, 0.98, and 0.92 across math, reading recognition, and reading comprehension scores, respectively. These results provide additional confirmation that further movements into the right tail of the behavioral problems index distribution produce the largest negative impacts on student achievement.

We conclude with a comparison of our Bayesian nonparametric results and those obtained via frequentist kernel regression. Specifically, we additionally employ the Nadaraya–Watson (Nadaraya [18] and Watson [19]) or constant kernel estimator to estimate our regression relationships nonparametrically:

\hat{E (S c o r e | B P I = B P I_{0})} = \frac{\sum_{i = 1}^{n} S c o r e_{i} K (\frac{B P I_{i} - B P I_{0}}{h})}{\sum_{i = 1}^{n} K (\frac{B P I_{i} - B P I_{0}}{h})},

where a Gaussian kernel K is selected and the bandwidth/global smoothing parameter h is set equal to five. The results are presented in Figure 6, with analysis again conducted separately for each test score outcome.

Table 1. Coefficient posterior means (and standard deviations below) from changepoint model.

	Math	Reading Rec.	Reading Comp.
$β_{0}$	108.08	108.62	106.16
$β_{0}$	(8.72)	(9.20)	(9.02)
$β_{1}$	0.008	0.029	0.014
$β_{1}$	(0.106)	(0.113)	(0.110)
$β_{2}$	−0.217	−0.323	−0.204
$β_{2}$	(0.182)	(0.183)	(0.197)
$σ^{2}$	178.27	195.84	176.92
$σ^{2}$	(7.88)	(8.65)	(7.82)

As shown in the figure, point estimates from the Bayesian and Nadaraya–Watson approaches are similar, but depart at the right tail of the BPI distribution, particularly for the mathematics and reading comprehension test score outcomes. These departures both illustrate the potential value of adaptive smoothing and reveal limitations when using a constant global smoothing parameter. In our sample, only 1.46% (

n = 15

) of observations contain a BPI in excess of 130, and thus considerable uncertainty exists when estimating the function over this region. A constant bandwidth

h = 5

is chosen within our kernel regression to suitably smooth the functions over small/moderate BPI values, while this same level of smoothing also implies that just a few noisy data points are used to estimate the functions at the largest BPI values. This often implies a rather erratic (and seemingly implausible) kernel-based point estimate when the BPI > 130. In contrast, the Bayesian adaptive procedure imposes a higher degree of smoothing over this region and leads to a continuation of the overall downward trend present in the figures for smaller values of the BPI.

6. Conclusions

A new scheme for nonparametric Bayesian estimations of regression functions was presented. An efficient Gibbs sampling algorithm was developed that involves sampling from standard distributions, permits local adaptability in the degree of smoothing, and imposes stationarity in the smoothing parameter process. The algorithm was shown to perform well in generated data experiments, and its value was illustrated in an application relating a behavioral problems index (BPI) to student achievement. Analysis of those data showed that the test score/BPI relationship is relatively flat/slightly negative when behavioral problems are low, while marginal increments to the BPI have a clearly negative effect on student achievement in math and reading as one moves into the right tail of the BPI distribution.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study can be accessed from https://www.nlsinfo.org/investigator/pages/login (accessed on October 2024).

Acknowledgments

The author is grateful for the research assistance provided by Yaling Qi in acquiring this data.

Conflicts of Interest

The author declares no conflicts of interest.

References

Smith, M.; Kohn, R. Nonparametric Regression Using Bayesian Variable Selection. J. Econom. 1996, 75, 317–343. [Google Scholar] [CrossRef]
Kohn, R.; Smith, M.; Chan, D. Nonparametric Regression Using Linear Combinations of Basis Functions. Stat. Comput. 2001, 11, 313–322. [Google Scholar] [CrossRef]
Chib, S.; Greenberg, E.; Simoni, A. Nonparametric Bayes Analysis of the Sharp and Fuzzy Regression Discontinuity Designs. Econom. Theory 2023, 39, 481–533. [Google Scholar]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Dixon, M.; Halperin, I.; Bilokon, P. Machine Learning in Finance; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Koop, G.; Poirier, D. Bayesian Variants of Some Classical Semiparametric Regression Techniques. J. Econom. 2004, 123, 259–282. [Google Scholar] [CrossRef]
Koop, G.; Tobias, J. Semiparametric Bayesian Inference in Smooth Coefficient Models. J. Econom. 2006, 134, 283–315. [Google Scholar] [CrossRef]
Chib, S.; Jeliazkov, I. Inference in Semiparametric Dynamic Models for Binary Longitudinal Data. J. Am. Stat. Assoc. 2006, 101, 685–700. [Google Scholar] [CrossRef]
Chan, J.; Tobias, J. An Alternate Parameterization for Bayesian Nonparametric/Semiparametric Regression. In Advances in Econometrics: Topics in Identification, Limited Dependent Variables, Partial Observability, Experimentation, and Flexible Modeling; Emerald Insight: Leeds, UK, 2019; Volume 40B, pp. 47–64. [Google Scholar]
Tobias, J.; Bond, T. Semiparametric Bayesian Estimation in an Ordinal Probit Model with Application to Life Satisfaction Across Countries, Age and Gender. J. Econom. 2025. in process. [Google Scholar] [CrossRef]
Chan, J.; Koop, G.; Poirier, D.; Tobias, J. Bayesian Econometric Methods, 2nd ed.; Cambridge University Press: Cambridge, UK, 2019; p. 484. [Google Scholar]
Kline, B.; Tobias, J. The Wages of BMI: Bayesian Analysis of a Skewed Treatment Response Model with Nonparametric Endogeneity. J. Appl. Econom. 2008, 23, 767–793. [Google Scholar] [CrossRef]
Chib, S.; Greenberg, E.; Jeliazkov, I. Estimation of Semiparametric Models in the Presence of Endogeneity and Sample Selection. J. Comput. Graph. Stat. 2009, 18, 321–348. [Google Scholar]
Omori, Y.; Chib, S.; Shephard, N.; Nakajima, J. Stochastic Volatility with Leverage: Fast and Efficient Likelihood Inference. J. Econom. 2007, 140, 425–449. [Google Scholar] [CrossRef]
Del Negro, M.; Primiceri, G. Time-Varying Structural Vector Autoregressions and Monetary Policy: A Corrigendum. Rev. Econ. Stud. 2015, 1342–1345. [Google Scholar] [CrossRef]
Chan, J.C. Comparing Stochastic Volatility Specifications for Large Bayesian VARs. J. Econom. 2023, 235, 1419–1446. [Google Scholar]
Kim, S.; Shephard, N.; Chib, S. Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models. Rev. Econ. Stud. 1998, 65, 361–393. [Google Scholar]
Nadaraya, E.A. On Estimating Regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
Watson, G.S. Smooth Regression Analysis. Sankhya Ser. A 1964, 26, 359–372. [Google Scholar]

Figure 1. Regression function and smoothing parameter posterior means: nonlinear example.

Figure 2. Prior and posterior distributions of

ρ

:

μ = 0.8

,

σ = 0.1

prior (left panels) and

μ = 0.8

,

σ = 1

prior (right panels).

Figure 2. Prior and posterior distributions of

ρ

:

μ = 0.8

,

σ = 0.1

prior (left panels) and

μ = 0.8

,

σ = 1

prior (right panels).

Figure 3. Regression function, smoothing parameter, and AR coefficient results: linear example.

Figure 4. Estimation results from nonparametric regressions of achievement on BPI.

Figure 5. Posterior frequencies of changepoint

λ

.

Figure 5. Posterior frequencies of changepoint

λ

.

Figure 6. Bayesian and Nadaraya–Watson estimates of achievement/BPI relationships.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tobias, J.L. Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors. Mathematics 2025, 13, 1162. https://doi.org/10.3390/math13071162

AMA Style

Tobias JL. Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors. Mathematics. 2025; 13(7):1162. https://doi.org/10.3390/math13071162

Chicago/Turabian Style

Tobias, Justin L. 2025. "Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors" Mathematics 13, no. 7: 1162. https://doi.org/10.3390/math13071162

APA Style

Tobias, J. L. (2025). Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors. Mathematics, 13(7), 1162. https://doi.org/10.3390/math13071162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Bayesian Nonparametric Regression via Stationary Smoothness Priors

Abstract

1. Introduction

2. The Model

Priors and Regression Function Smoothing

3. Posterior Simulation

4. Generated Data Experiments

5. Application

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI