Bandwidth Selection for Prediction in Regression

Barbeito, Inés; Cao, Ricardo; Sperlich, Stefan

doi:10.3390/proceedings2019021042

Open AccessProceeding Paper

Bandwidth Selection for Prediction in Regression^†

by

Inés Barbeito

^1,*

,

Ricardo Cao

¹ and

Stefan Sperlich

²

¹

Research Group MODES, Department of Mathematics, CITIC, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain

²

Geneva School of Economics and Management, Université de Genève, Bd du Pont d’Arve 40, CH-1211 Genève, Switzerland

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2nd XoveTIC Conference, A Coruña, Spain, 5–6 September 2019.

Proceedings 2019, 21(1), 42; https://doi.org/10.3390/proceedings2019021042

Published: 5 August 2019

(This article belongs to the Proceedings of The 2nd XoveTIC Conference (XoveTIC 2019))

Download Versions Notes

Abstract

:

There exist many different methods to choose the bandwidth in kernel regression. If, however, the target is regression based prediction for samples or populations with potentially different distributions, then the existing methods can easily be suboptimal. This situation occurs for example in impact evaluation, data matching, or scenario simulations. We propose a bootstrap method to select a global bandwidth for nonparametric out-of-sample prediction. The asymptotic theory is developed, and simulation studies show the successful operation of our method. The method is used to predict nonparametrically the salary of Spanish women if they were paid along the same wage equation as men, given their own characteristics.

Keywords:

bandwidth selection; nonparametric prediction; smooth bootstrap

1. Introduction

While there exist a considerable literature on bandwidth selection for kernel based nonparametric density and regression estimation, the problem of nonparametric prediction has largely been ignored. To our knowledge, such selection method does not exist albeit the relevance and frequency of such prediction problems in practise. They include for example any situation for which you want to predict counterfactuals like in impact evaluation (also known as treatment effect estimation). Other examples are statistical matching or data matching (see [1,2,3], and references therein), the imputation of missings (see e.g., [4,5,6], and references therein), or the simulation of scenarios. Note that we are not thinking of extrapolation far outside of the support of the observed covariates, a problem that would go beyond the here described ones, see [7]. We do not refer to bandwidth selection in stationary time series. In this context, various bandwidth and other model selection methods have been developed, see e.g. the review of Antoniadis, [8] or [9].

In all these situations have the following three features in common: you can think of a regression model with Y being the left-hand, and X the observed right-hand variables. You have one sample, denoted as ’source’, in which both are given such that you can conduct a nonparametric regression. At the same time you have or simulate another sample or population, denoted as ’target’, for which the same (as for ’source’) potential response Y is not obtained. The basic assumption is that the dependence structure between, or in our case the conditional expectation of Y given X,

m (x) : = E [Y | X = x]

is the same in both populations. In data matching, and similarly when imputing missings, the Y were not sampled for the target sample; in scenarios the X of the target refer to an artificial, maybe future population, for which we just cannot observe any Y; in counterfactual exercises you typically have Y observed for the target sample, but under a different situation, called ’treatment’. Then you use the source sample to impute the potential Y of the target group for the situation ’without treatment’. The difference between the observed Y (under treatment) and the imputed (without treatment) gives the so-called ’treatment effect for the treated’.

Our proposal relies on the so-called smooth bootstrap approach, see [10]. That is, you aim to draw bootstrap samples from a nonparametric pre-estimate of the joint distribution of

(X, Y)

. For the original source sample, and for each bootstrap sample you estimate

m (x)

. These allow us to approximate the mean squared error of

\hat{m} (x)

for any x inside the support of X. Finally you average these over the

x_{i}

observed in the target sample. We said ’you aim’ because it can be shown that there exists a closed analytical form for the resulting MASE estimate. This simplifies the procedure drastically making it quite attractive in practise. One may argue that the exactness of this MASE approximation hinges a lot upon the pre-estimate. Yet, for finding the optimal bandwidth (or model) it suffices that our MASE approximations take their minimum at the same bandwidth as the true but unknown MASE. Our simulation studies show that this is actually the case. This work is collected in [11].

2. The Bandwidth Selection Method

Suppose we are provided with a complete sample

{(x_{i}^{0}, y_{i}^{0})}_{i = 1}^{n^{0}}

from the source population with

X^{0} \sim f^{0}

and

m (x) : = E_{0} [Y^{0} | X^{0} = x]

. For the target population we only are provided with observations

{x_{i}^{1}}_{i = 1}^{n_{1}}

from density

f^{1}

which is potentially different from

f^{0}

. We are interested in predicting the expected

{y_{i}^{1}}_{i = 1}^{n_{1}}

assuming that

m (x) = E_{1} [Y^{1} | X^{1} = x]

, or to estimate

E [Y^{1}] = E [m (X^{1})]

. Moreover, if some ’outcomes’

y_{i}

are observed for the target population, their conditional expectation is supposed to differ from

m (\cdot)

; recall our example of outcome under treatment vs without, or see our application were

m (x)

is the expected wage given x if you were a man.

For the prediction we have to estimate

m (\cdot)

by a Nadaraya-Watson estimator

{\hat{m}}_{h}^{N W}

with bandwidth h. Let us suppress for a moment the hyper-indices thinking for now always in the source sample with Y observed. The challenge is to find a bandwidth h which is MASE optimal for our predicting problem. The point-wise MSE, and afterwards the MASE are approximated by their bootstrap versions obtained as follows: Imagine

(X_{1}^{*}, Y_{1}^{*}), \dots, (X_{n}^{*}, Y_{n}^{*})

are bootstrap samples drawn from the kernel density

{\hat{f}}_{g} (x, y) = n^{- 1} \sum_{i = 1}^{n} K_{g} (x - X_{i}) K_{g} (y - Y_{i})

with bandwidth g. Then, for

{\tilde{m}}_{h}^{N W *} (x) f (x) = {\hat{f}}_{h} (x) {\hat{m}}_{h}^{N W} (x)

we get

\begin{matrix} {\tilde{m}}_{h}^{N W *} (x) - {\hat{m}}_{g}^{N W} (x) & = & \frac{1}{n {\hat{f}}_{g} (x)} \sum_{i = 1}^{n} K_{h} (x - X_{i}^{*}) (Y_{i}^{*} - {\hat{m}}_{g}^{N W} (x)), \end{matrix}

(1)

where

X^{*}

has bootstrap marginal density

{\hat{f}}_{g}

, and

E [Y^{*}| X^{*} = x] = {\hat{m}}_{g}^{N W} (x)

. Clearly, this is the bootstrap analogue to

\begin{matrix} {\tilde{m}}_{h}^{N W} (x) - m (x) & = & \frac{1}{n f (x)} \sum_{i = 1}^{n} K_{h} (x - X_{i}) (Y_{i} - m (x)) . \end{matrix}

(2)

In order to compute the MASE we need to carefully distinguish between source and target sample, and have therefore to use the hyper-indices again. For finding a globally optimal bandwidth h we would like to minimise

\begin{matrix} M A S E_{{\tilde{m}}_{h}^{N W}, X^{1}} (h) & = & \frac{1}{n_{1}} \sum_{j = 1}^{n_{1}} [E_{0} [{({\tilde{m}}_{h}^{N W} (X_{j}^{1}) - m (X_{j}^{1}))}^{2}]], \end{matrix}

(3)

where

E_{0}

refers to the expectation in the source population. We have in the bootstrap world

\begin{matrix} M A S E_{{\tilde{m}}_{h}^{N W}, X^{1}}^{*} (h) & = & \frac{1}{n_{1}} \sum_{j = 1}^{n_{1}} \frac{1}{{\hat{f}}_{g}^{0} {(X_{j}^{1})}^{2}} [(1 - \frac{1}{n_{0}}) \cdot {([K_{h} * {\hat{q}}_{X_{j}^{1}, g}^{0}] (X_{j}^{1}))}^{2} \\ + \frac{1}{n_{0}} [{(K_{h})}^{2} * {\hat{p}}_{X_{j}^{1}, g}^{0}] (X_{j}^{1}) + \frac{g^{2} d_{K}}{n_{0}^{2}} \sum_{i = 1}^{n_{0}} [{(K_{h})}^{2} * K_{g}] (X_{j}^{1} - X_{i}^{0})] . \end{matrix}

(4)

A bootstrap bandwidth selector for prediction is defined as

\begin{matrix} h_{B O O T}^{N W} & = & h_{M A S E_{{\tilde{m}}_{h}^{N W}, X^{1}}^{*}} = arg min_{h > 0} M A S E_{{\tilde{m}}_{h}^{N W}, X^{1}}^{*} (h) . \end{matrix}

Note that the computation of

h_{B O O T}^{N W}

does not require the use of Monte Carlo approximation nor the nonparametric estimation of the density

f^{1}

of the target population.

References

De Waal, T.; Pannekoek, J.; Scholtus, S. Handbook of Statistical Data Editing and Imputation; John Wiley: New York, NY, USA, 2011. [Google Scholar]
Rässler, S. Data Fusion: Identification Problems, Validity, and Multiple Imputation. Aust. J. Stat. 2004, 33, 153–171. [Google Scholar]
Eurostat. Statistical Matching: A Model Based Approach for Data Integration; Methodologies and Working Papers; Eurostat: Luxembourg, 2013. [Google Scholar]
Horton, N.J.; Lipsitz, S.R. Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables. Am. Stat. 2001, 55, 244–254. [Google Scholar] [CrossRef]
Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley: New York, NY, USA, 2004. [Google Scholar]
Su, Y.-S.; Gelman, A.; Hill, J.; Yajima, M. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. J. Stat. Softw. 2010, 45, 1–31. [Google Scholar] [CrossRef]
Li, X.; Heckman, N.E. Local linear extrapolation. J. Nonparametr. Stat. 2003, 15, 565–578. [Google Scholar] [CrossRef]
Antoniadis, A.; Paparoditis, E.; Sapatinas, T. Bandwidth selection for functional time series prediction. Stat. Probab. Lett. 2009, 79, 733–740. [Google Scholar] [CrossRef]
Tschernig, R.; Yang, L. Nonparametric lag selection for time series. J. Time Ser. Anal. 2000, 21, 457–487. [Google Scholar] [CrossRef]
Cao-Abad, R.; González-Manteiga, W. Bootstrap methods in regression smoothing. J. Nonparametr. Stat. 1993, 2, 379–388. [Google Scholar] [CrossRef]
Barbeito, I.; Cao, R.; Sperlich, S. Bandwidth Selection for Nonparametric Kernel Prediction. Unpublished work. 2019. [Google Scholar]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barbeito, I.; Cao, R.; Sperlich, S. Bandwidth Selection for Prediction in Regression. Proceedings 2019, 21, 42. https://doi.org/10.3390/proceedings2019021042

AMA Style

Barbeito I, Cao R, Sperlich S. Bandwidth Selection for Prediction in Regression. Proceedings. 2019; 21(1):42. https://doi.org/10.3390/proceedings2019021042

Chicago/Turabian Style

Barbeito, Inés, Ricardo Cao, and Stefan Sperlich. 2019. "Bandwidth Selection for Prediction in Regression" Proceedings 21, no. 1: 42. https://doi.org/10.3390/proceedings2019021042

Article Menu

Bandwidth Selection for Prediction in Regression^†

Abstract

1. Introduction

2. The Bandwidth Selection Method

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Bandwidth Selection for Prediction in Regression †

Abstract

1. Introduction

2. The Bandwidth Selection Method

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Bandwidth Selection for Prediction in Regression^†