1. Scenario
Let us consider a sample of size
n,
, drawn from a nonparametric regression model
. We assume random design,
and
. In this context, we deal with the Nadaraya-Watson estimator [
1] for the regression function,
m, which is characterized by the kernel function
K and the bandwidth or smoothing parameter
. Under suitable conditions, the asymptotically optimal (in the sense of minimum AMISE) bandwidth satisfies
Since we are assuming that the sample size, n, is very large, the task of computing a bandwidth selector using the whole sample would be too computationally expensive. For example, the leave-one-out cross-validation (LOO CV) bandwidth selector has complexity .
2. Bandwidth Selection
The idea behind our proposal is to find the LOO CV bandwidth for several subsamples and then extrapolate the result to the original sample size using the asymptotic expression of the MISE bandwidth (
1).
2.1. One Subsample Size (OSS)
The idea behind this method is to draw several subsamples of size
r, much smaller than
n, then compute the LOO CV selector and finally use Equation (
1) to extrapolate the CV bandwidth for the original sample size (this idea was already proposed in [
2] in the context of kernel density estimation to reduce the variance of the CV bandwidth selector).
Obtain s subsamples of size subsampling without replacement from our original dataset.
For each subsample, find the LOO CV bandwidth.
Let denote the average of these bandwidths.
We estimate the unknown constant by .
Therefore, our estimate of the AMISE bandwidth would be .
2.2. Several Subsample Sizes (SSS)
We now propose a method that considers several subsamples of different sizes.
Consider a grid of subsample sizes, , with .
For each , compute the LOO CV bandwidth, (several subsamples of each size could be considered).
Solve the ordinary least squares problem (or a robust analogue) given by , in which case and is our estimate of the order of convergence of the AMISE bandwidth.
Our estimate of the AMISE bandwidth for the original sample size, n, would be .
3. Simulation Study
Let us consider samples of size drawn from the model , where , and . Furthermore, we have considered a Gaussian kernel and, as a weight function, , where denotes the marginal quantile function of X.
It is clear from
Figure 1 that the OSS selector outperforms the SSS selector in terms of statistical precision. Moreover, in many cases bandwidths that are quite distant from the optimum do not have an associated large error (in terms of AMISE). On the other hand, as we can observe in
Table 1 and
Table 2, the OSS selector is substantially faster than the SSS selector due to the fact that the former works with a single subsample size which, in turn, is even smaller than most of those considered for the SSS selector). It should be noted that the source code for both selectors was written in C
++ and run in parallel on an Intel Core i5-8600K 3.6 GHz.
Author Contributions
Conceptualization, D.B.-U., R.C. and M.F.-F.; Methodology, D.B.-U., R.C. and M.F.-F.; Software, D.B.-U., R.C. and M.F.-F.; Validation, D.B.-U., R.C. and M.F.-F.; Formal Analysis, D.B.-U., R.C. and M.F.-F.; Investigation, D.B.-U., R.C. and M.F.-F.; Resources, D.B.-U., R.C. and M.F.-F.; Data Curation, D.B.-U., R.C. and M.F.-F.; Writing—Original Draft Preparation, D.B.-U., R.C. and M.F.-F.; Writing—Review & Editing, D.B.-U., R.C. and M.F.-F.; Visualization, D.B.-U., R.C. and M.F.-F.; Supervision, D.B.-U., R.C. and M.F.-F.; Project Administration, D.B.-U., R.C. and M.F.-F.; Funding Acquisition, D.B.-U., R.C. and M.F.-F. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Acknowledgments
This research has been supported by MINECO grant MTM-2014-52876-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and Centro Singular de Investigación de Galicia ED431G/01), all of them through the ERDF.
Conflicts of Interest
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
References
- Nadaraya, E.A. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
- Wang, Q.; Lindsey, B.G. Improving cross-validated bandwidth selection using subsampling-extrapolation techniques. Comput. Stat. Data Anal. 2015, 89, 51–71. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).