1. Introduction
While there exist a considerable literature on bandwidth selection for kernel based nonparametric density and regression estimation, the problem of nonparametric prediction has largely been ignored. To our knowledge, such selection method does not exist albeit the relevance and frequency of such prediction problems in practise. They include for example any situation for which you want to predict counterfactuals like in impact evaluation (also known as treatment effect estimation). Other examples are statistical matching or data matching (see [
1,
2,
3], and references therein), the imputation of missings (see e.g., [
4,
5,
6], and references therein), or the simulation of scenarios. Note that we are not thinking of extrapolation far outside of the support of the observed covariates, a problem that would go beyond the here described ones, see [
7]. We do not refer to bandwidth selection in stationary time series. In this context, various bandwidth and other model selection methods have been developed, see e.g. the review of Antoniadis, [
8] or [
9].
In all these situations have the following three features in common: you can think of a regression model with Y being the left-hand, and X the observed right-hand variables. You have one sample, denoted as ’source’, in which both are given such that you can conduct a nonparametric regression. At the same time you have or simulate another sample or population, denoted as ’target’, for which the same (as for ’source’) potential response Y is not obtained. The basic assumption is that the dependence structure between, or in our case the conditional expectation of Y given X, is the same in both populations. In data matching, and similarly when imputing missings, the Y were not sampled for the target sample; in scenarios the X of the target refer to an artificial, maybe future population, for which we just cannot observe any Y; in counterfactual exercises you typically have Y observed for the target sample, but under a different situation, called ’treatment’. Then you use the source sample to impute the potential Y of the target group for the situation ’without treatment’. The difference between the observed Y (under treatment) and the imputed (without treatment) gives the so-called ’treatment effect for the treated’.
Our proposal relies on the so-called smooth bootstrap approach, see [
10]. That is, you aim to draw bootstrap samples from a nonparametric pre-estimate of the joint distribution of
. For the original source sample, and for each bootstrap sample you estimate
. These allow us to approximate the mean squared error of
for any
x inside the support of
X. Finally you average these over the
observed in the target sample. We said ’you aim’ because it can be shown that there exists a closed analytical form for the resulting MASE estimate. This simplifies the procedure drastically making it quite attractive in practise. One may argue that the exactness of this MASE approximation hinges a lot upon the pre-estimate. Yet, for finding the optimal bandwidth (or model) it suffices that our MASE approximations take their minimum at the same bandwidth as the true but unknown MASE. Our simulation studies show that this is actually the case. This work is collected in [
11].
2. The Bandwidth Selection Method
Suppose we are provided with a complete sample from the source population with and . For the target population we only are provided with observations from density which is potentially different from . We are interested in predicting the expected assuming that , or to estimate . Moreover, if some ’outcomes’ are observed for the target population, their conditional expectation is supposed to differ from ; recall our example of outcome under treatment vs without, or see our application were is the expected wage given x if you were a man.
For the prediction we have to estimate
by a Nadaraya-Watson estimator
with bandwidth
h. Let us suppress for a moment the hyper-indices thinking for now always in the source sample with
Y observed. The challenge is to find a bandwidth
h which is MASE optimal for our predicting problem. The point-wise MSE, and afterwards the MASE are approximated by their bootstrap versions obtained as follows: Imagine
are bootstrap samples drawn from the kernel density
with bandwidth
g. Then, for
we get
where
has bootstrap marginal density
, and
. Clearly, this is the bootstrap analogue to
In order to compute the MASE we need to carefully distinguish between source and target sample, and have therefore to use the hyper-indices again. For finding a globally optimal bandwidth
h we would like to minimise
where
refers to the expectation in the source population. We have in the bootstrap world
A bootstrap bandwidth selector for prediction is defined as
Note that the computation of does not require the use of Monte Carlo approximation nor the nonparametric estimation of the density of the target population.