Abstract
Functional data, which provides information about curves, surfaces or anything else varying over a continuum, has become a commonly encountered type of data. The k-nearest neighbor (kNN) method, as a nonparametric method, has become one of the most popular supervised machine learning algorithms used to solve both classification and regression problems. This paper is devoted to the k-nearest neighbor (kNN) estimators of the nonparametric functional regression model when the observed variables take values from negatively associated (NA) sequences. The consistent and complete convergence rate for the proposed kNN estimator is first provided. Then, numerical assessments, including simulation study and real data analysis, are conducted to evaluate the performance of the proposed method and compare it with the standard nonparametric kernel approach.
Keywords:
convergence rate; NA samples; functional data; nonparametric regression model; k-nearest neighbor estimator MSC:
62G08; 62G20
1. Introduction
Functional data analysis (FDA) is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function.
Popularized by Ramsay and Silverman [1,2], statistics for functional data analysis have attracted considerable research interest because of its wide applications in many practical fields, such as medicine, economics and linguistics. For an introduction to the topics, we can refer to the monographs of Ramsay and Silverman [3] for parametric models, and Ferraty and Vieu [4] for nonparametric models.
In this paper, the following functional non-parametric regression model is considered.
where Y is a scalar response variable, is a covariate taking value in a subset of an infinite-dimensional functional space endowed with a semi-metric . is the unknown regression operator from to , and the random error satisfies
For the estimation of model (1), Ferraty and Vieu [5] investigated the classical functional Nadaraya-Watson (N-W) kernel type estimator of and obtained the asymptotic properties with rates in the case of -mixing functional data. Ling and Wu [6] studied the modified N-W kernel estimate and derived the asymptotic distribution for strong mixing functional time series data, Baíllo and Grane [7] proposed a functional local linear estimate based on the local linear idea. In this paper, we focus on the k-nearest neighbors (kNN) method for regression model (1). The kNN method, as one of the most simple and traditional nonparametric techniques, is often used as a nonparametric classification method. The kNN method was first developed by Evelyn Fix and Joseph Hodges in 1951 [8] and then expanded by Thomas Cover [9]. In our kNN regression, the input consists of the k-closest training examples in a dataset, whereas the output is the property value for the object. This value is the average of the values of the k-nearest neighbors. Under independent samples, research in kNN regression mostly focuses on the estimation of the continuous regression function . For example, Burba et al. [10] investigated the kNN estimator based on the idea of the local adaptive bandwidth of functional explanatory variables. The papers [11,12,13,14,15,16,17,18], and others, obtained the asymptotic behavior of nonparametric regression estimators for functional data in independent and dependent cases. Further, Kudraszow and Vieu [19] obtained asymptotic results for a kNN generalized regression estimator when the observed variables take values in an abstract space. Kara-Zaitri et al. [20] provided an asymptotic theory for several different target operators and some simulated experiences, including regression, conditional density, conditional distribution and hazard operators. However, functional observations often behave with correlation, including satisfying some form of negative dependence or negative association.
Negatively associated (NA) sequences were introduced by Joag-Dev and Proschan in [21]. Random variables are said to be NA, if for every pair of disjoint subsets
or equivalently,
where f and g are coordinatewise non-decreasing, such that this covariance exists. An infinite sequence is NA if every finite subcollection is NA.
For example, if follows permutation distributions, where always and are n real numbers, then is NA.
Whereas kNN regression under NA sequences has not been explored in the literature, in this paper, we extend the kNN estimation of functional data from the case of independent samples to NA sequences.
Let a pair be a sample of NA pairs in , which is a random vector valued in the . is a semi-metric space, is not necessarily of the finite dimension and we do not suppose the existence of a density for the functional random variable . For a fixed , the closed ball with as the center and as the radius is denoted as:
The kNN regression estimator [10] is defined as follows:
where is the kernel function supported on . is a positive random variable that depends on and is defined by:
obviously, the kNN estimator can be seen as an expansion to a random locally adaptive neighborhood of the traditional kernel method [5] defined as:
where is a sequence of positive real numbers such as a.s. .
This paper is organized as follows. The main results of our paper about the asymptotic behavior of the kNN estimators using a data-driven random number of neighbors are given in Section 2. Section 3 illustrates the numerical performance of the proposed method, including nonparametric functional regression analysis of the sea level surface temperature (SST) data for the El Nio area (0–100 S, 800–900 W). The technical proofs are postponed to Section 4. Finally, Section 5 is devoted to comments on the results and to related perspectives for the future.
2. Assumptions and Main Results
In this section, we focus on the asymptotic property of the kNN regression estimator and need to state the convergence rate of an estimator.
One says that the rate of almost complete convergence of a sequence to Y is of order if only if for any ,
and we write (see for instance [5]). By the Borel-Cantelli lemma, this implies that almost surely, so almost complete convergence is a stronger result than almost sure convergence.
Our results are stated under some mild assumptions we gather below for easy references. Throughout the paper, we will denote by some positive generic constants, which may be different in various places.
Assumption 1.
and is a continuous function, and strictly monotonically increasing at the origin with .
Assumption 2.
There exist a function and a bounded function such that:
- (i)
- F, and
- (ii)
- for any
- (iii)
- such that
Assumption 3.
is a nonnegative bounded kernel function with support [0, 1], and if , the derivative exists on [0, 1] satisfying:
Assumption 4.
is a bounded Lipschitz operator with order β on , and there exists such that:
Assumption 5.
with continuous on
Assumption 6.
Kolmogorov’s ϵ-entropy of satisfies:
For , the Kolmogorov’s ϵ-entropy of some set is defined by where is the minimal number of open balls, which can cover with as the center and ϵ as the radius in .
Remark 1.
Assumption 1, Assumption 2((i)–(iii)) and Assumption 4 are the standard assumptions for small ball probability and regression operators in nonparametric FDA, see Kudraszow and Vieu [19]. Assumption 2(ii) will play a key role in the methodology particularly when we compute the asymptotic variance and permit it to be explicit in Ling and Wang [6]. Assumption 2(iii) shows that the small ball probability can be written as the product of the two independent functions and , which has been used many times in Masry [11], Laib and Louani [12] and other literatures. Assumption 5 is standard in the nonparametric setting and concerns the existence of the conditional moments in Masry [11] and Burba [10], which aims to obtain the rate of uniform almost complete convergence. Assumption 6 assumes the Kolmogorov’s ϵ-entropy condition, which we will use in the following proof of the rate of uniform almost complete convergence.
Theorem 1.
Under Assumptions 1–6, suppose that sequence satisfies , and for n large enough, then we have:
Remark 2.
The Theorem extends the kNN estimation result of Theorem 2 in Kudraszow and Vieu [19] from the independent case to the NA mixed dependent case, and obtains the same convergence rate under the assumptions. Second, the almost complete convergence rate of the prediction operator is divided into two parts, one part affected by strong mixing and Kolmogorov’s -entropy, and the other part depends on the smoothness of the regression operator and smoothness parameter k.
Corollary 1.
Under the condition of the Theorem, we have:
Corollary 2.
Under the condition of the Theorem, we have:
3. Simulation
3.1. A simulation Study
In this section, we aim at illustrating the performance of the nonparametric functional regression model and we will make a comparison with traditional kernel density estimation methods. We consider the nonparametric functional regression model:
where , is distributed according to , the functional curve is generated in the following way:
where , , 0 represents zero vector and the covariance matrix is defined as:
By the definition of NA, it can be seen that is an NA vector for each with a finite moment of any order (see Wu and Wang [22]).
We choose casually that , the sample sizes n as , t takes 1000 equispaced values in . We carry out the simulation of the curve for the 330 samples (see Figure 1).
Figure 1.
Curve-sample with sample size of n = 330.
We consider the Epanechnikov kernel given by , and the semi-metrics based on derivatives of order q.
Our purpose is to compare the mean square error (MSE) of the kNN method with the NW kernel approach on finite simulated datasets. In the finite sample simulation, the following steps are followed.
Step 1: We take 300 curves to construct the training samples , and the other 30 constitute the test samples .
Step 2: In the training sample, the parameters k and h in the kNN method and NW kernel method are automatically selected based on the cross-validation method, respectively.
Step 3: Based on the MSE standard (see [4] for details), we obtain that the respective semi-metric parameters q in both the kNN method and the NW method takes .
Step 4: The response values and of the test sample are calculated by using the kNN method and the NW method, respectively, and their MSE and scatter plots against the true value are represented by Figure 2.
Figure 2.
Prediction effects of the two estimation methods. (a) kNN estimation method. (b) NW estimation method.
As we can see in Figure 2, the MSE of the kNN method is much smaller than that of the NW method, and the scattered points in Figure 2 are more densely distributed around the linear function , which shows that the kNN method has a better fit and higher prediction accuracy for the NA dependent functional samples.
The kNN method and NW method were used to conduct 100 independent replicated experiments at sample sizes of , respectively. AMSE was calculated for both methods at different sample sizes using the following equation.
As can be seen from Table 1, the AMSE of the kNN method is much smaller than that of the NW kernel method when the sample size is fixed at , respectively; when the estimation method is fixed, the AMSE of the two estimation methods have the same trend—they both decrease as the sample size increases. However, the decreasing speed of the kNN method is significantly faster than that of the NW kernel method.
Table 1.
The AMSE of the predicted response variables of the two methods under different sample sizes.
3.2. A Real Study
This section applies the proposed kNN regression analysis of the data, which consist of the sea level surface temperature (SST) for the El area (0–100 S, 800–900 W) for a total of 31 years from 1 January 1990 to 31 December 2020. The data are available online at the website: https://www.cpc.ncep.noaa.gov/data/indices/ (accessed on 1 January 2022). More relevant discussions of these data can be found in Ezzahrioui et al. [13,14], Delsol et al. [23], and Ferraty et al. [24] The 1618 weekly SST data from the original data were preprocessed and averaged by month to obtain 372 monthly average SST discrete data. Figure 3 displays the decomposition of the multiplicative time series of the monthly SST.
Figure 3.
Monthly mean SST factor decomposition fitting comprehensive output diagram.
Figure 4 shows that the monthly average SST in El Nio regions from 1990 to 2020 had a clear seasonal variation, and the monthly trend of SST can also clearly be observed from the seasonal index plot of the monthly mean SST.
Figure 4.
Time series curve of SST in El Nio during 31 years.
The main factors affecting the temperature variation can be generally summarized as seasonal factors and random fluctuations. If the seasonal factor is removed, the SST should be left with only random fluctuations, i.e., the values fluctuate up and down at some mean value. At the same time, if the effect of random fluctuations is not considered, the SST is left with only the seasonal factor, i.e., the SST will have similar values in the same month in different years.
The following steps implement the kNN regression estimation method for the analysis of the SST data and display the comparison with the NW sum estimation method in Figure 5.
Figure 5.
Forecast value of SST in 2020 by KNN method and NW method.
Step 1: Transform 372 months (31 years) of SST data into functional data.
Step 2: Divide the 31 samples of data into two parts: 30 training samples of data for model fitting and 1 test sample of data for prediction assessment.
Step 3: Here, the functional principal component analysis (FPCA) is applicable to semi-measures for rough curves such as SST data (see Chapter 3 of Ferraty et al. [25] for the methodology). A quadratic kernel function used in Section 3.1 is used in kNN regression.
Step 4: The SST values for 12 months in 2020 are predicted by the kNN method and the NW method, respectively, along with obtaining their MSEs for both methods.
Then, in step 1, we split the discrete monthly average temperature data of 372 months into 31 years of temperature profiles and express them as . Therefore, the response variable can be expressed as ,. Thus, is the sample set of dependent function type with a sample size of 30, where is the function type data, and is a real value.
In Step 3, the choice of parameters q for the kNN method and NW method is performed via computation of cross-validation in R, which gives and for the kNN regression method and NW method, respectively. The selection of parameters k and h is similar to Section 3.1.
From Figure 5, which compares the MSE values calculated by the two methods, it can be seen that the MSE of the kNN method is much smaller than that of the NW method. Further, noting that the degree of fit between the curves fitted by the two methods to the true curve (dotted line), the predicted curves by two methods are generally closer to the true curve, indicating that the prediction effect of both methods is very good. However, a closer look reveals that the predicted values of the kNN method obviously have better fitting at the inflection points of the curves, such as January, February, March, November and December, which fully reflect the fact that the kNN method pays more attention to the local variation than the NW method when processing the data like this, including the abnormal or extreme distribution of the response variable.
4. Proof of Theorem
In order to prove the main results, we give some lemmas. Let be n random pairs valued in , where is a general measurable space. Let be a fixed subset of , be a measurable function, for ,
is a sequence of random real variables (r.r.v.), and is a nonrandom function such that . For , we define:
Lemma 1
([10]). Let be a decreasing positive real sequence satisfying . For any increasing sequences and , there exist two real random sequences and such that:
- (L1)
- ,
- (L2)
- a.co.
- (L3)
- (L4)
- (L5)
then, we have:
The proof of Lemma 1 is not presented here because it follows, step by step, the same argument in Burba et al. [10], Kudraszow and Vieu [19].
Lemma 2
([26]). Let be an NA random sequence with zero mean, and there exists a positive constant such that , let . For any , we get:
and
Lemma 3.
Suppose that Assumptions 1–6 hold, and a.s. in model (3) satisfying:
and for n large enough,
then we have:
where
Proof of Lemma 3.
In order to simplify the proof, we introduce some notations in this article. For , let , be the mixed operator covariance,
where ,
For the fixed in model (3), we have the decomposition as follows:
where:
It suffices to prove the three following results in order to establish (9),
Then, we need to show the Equation (11). In fact, we have the decomposition as follows:
For , by Assumption 3, it is easily seen that:
thus,
for , we have:
According to (4) in Lemma 2 and Assumption 6, we have:
Hence, it follows that:
For , similar to the proof of , for , we have:
Thus,
Finally, for , we can get . The proof process is similar to , and we can obtain:
Therefore, combining the Equations (13)–(15), the Equation (11) can be established.
Similarly, we may prove the Equation (12). Hence, the proof of Lemma 3 is completed. □
Proof of Theorem 1.
According to Lemma 1, let , , , , , , . Let be an increasing sequence such that , where is a decreasing positive real sequence such that and . Let and be two real random sequences such that:
Firstly, we verify the conditions and in Lemma 1. By and , it is easy to follow that the local bandwidth satisfies the condition (5). Combining with Assumption 2, it follows that satisfies the condition (6). Let , from Assumption 2(i) we obtain that is satisfied. Hence, according to the conditions of the Theorem, the Equations (7) and (8) in Lemma 3 hold. Thus, by Lemma 3, we have:
Similarly, for , we can also get:
Secondly, checking the conditions and in Lemma 1, and combining (16) and (17) with , it is clearly followed that:
By Assumption 1 we get:
According to (5) and (18), for , we have:
That is:
Therefore, by Assumption 1 we can get:
Thus,
is checked.
Finally, we establish the condition in Lemma 1. Similar to Kudraszow and Vieu [19], we denote:
and let:
Then, can be decomposed as follows:
By Assumption 3, it is followed that:
and for , refering to Ferraty et al. [25], we have:
Therefore,
Moreover, for , according to Lemma 1 in Ezzahrioui and Ould-said [13] and Assumption 2(iii), there exists , for ,
by , holds. Hence, for , it follows that
Combining (19)–(22), we obtain:
is established.
Thus, the conditions – in Lemma 1 have been established. By Lemma 1, we can get:
The proof of the Theorem 1 is completed. □
5. Conclusions and Future Research
Functional data analysis deals with the analysis and theory of data that are in the form of functions, images and shapes, or more general objects. In a way, correlation is really the heart of data science. The correlation between variables may be complicated, from simply independent to -mixing or something else, such as negatively associated (NA). The kNN method, as one of the nonparametric methods, is very useful in statistical estimation and machine learning. While regression analysis of functional data under many variable correlated cases, except NA sequences, has been explored. This paper builds a kNN regression estimator of the functional regression model. In particular, we obtain the almost complete convergence rate of kNN estimation. Some simulated experiments and real data analyses illustrate the feasibility and the finite-sample behavior of the method. Further work includes introducing the kNN machine learning algorithm for functional data analysis and kNN high-dimensional modeling with NA sequences.
Author Contributions
Conceptualization, X.H. and J.W.; methodology, X.H.; software, J.W.; writing—original draft preparation, X.H. and J.W.; writing—review and editing, K.Y. and L.W.; visualization, K.Y.; supervision, X.H.; project administration, K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Social Science Foundation (Grant No. 21BTJ040).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
https://www.cpc.ncep.noaa.gov/data/indices/ (accessed on 9 January 2022).
Acknowledgments
The authors are most grateful to the Editor and anonymous referee for carefully reading the manuscript and for valuable suggestions which helped in improving an earlier version of this paper. This research was funded by the National Social Science Foundation (Grant No. 21BTJ040).
Conflicts of Interest
The authors declare no conflict of interest in this paper.
Abbreviations
The following abbreviations are used in this manuscript:
| NA | Negatively Associated |
| kNN | k-Nearest Neighbor |
References
- Ramsay, J.; Dalzell, C. Some Tools for Functional Data Analysis. J. R. Stat. Soc. Ser. B Methodol. 1991, 53, 539–561. [Google Scholar] [CrossRef]
- Ramsay, J.; Silverman, B. Functional Data Analysis; Springer: New York, NY, USA, 1997. [Google Scholar]
- Ramsay, J.; Silverman, B. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
- Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis; Springer: New York, NY, USA, 2006. [Google Scholar]
- Ferraty, F.; Vieu, P. Nonparametric Models for Functional Data, with Application in Regression, Time Series Prediction and Curve Discrimination. J. Nonparametr. Stat. 2004, 16, 111–125. [Google Scholar] [CrossRef]
- Ling, N.X.; Wu, Y.H. Consistency of Modified Kernel Regression Estimation with Functional Data. Statistics 2012, 46, 149–158. [Google Scholar] [CrossRef]
- Baíllo, A.; Grané, A. Local Linear Regression for Functional Predictor and Scalar Response. J. Multivar. Anal. 2009, 100, 102–111. [Google Scholar] [CrossRef]
- Fix, E.; Hodges, J. Discriminatory Analysis. Nonarametric Discrimination: Consistency Properties. Inter. Stat. Re. 1989, 57, 238–247. [Google Scholar] [CrossRef]
- Altman, N.S. An introduction to kernel and nearest-neighbor nonparametic regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
- Burba, F.; Ferraty, F.; Vieu, P. k-Nearest Neighbour Method in Functional Nonparametric Regression. J. Nonparametr. Stat. 2009, 21, 453–469. [Google Scholar] [CrossRef]
- Masry, E. Nonparametric Regression Estimation for Dependent Functional Data: Asymptotic Normality. Stoch. Pro 2005, 115, 155–177. [Google Scholar] [CrossRef]
- Laib, N.; Louani, D. Rates of strong consistencies of the regression function estimator for functional stationary ergodic data. J. Stat. Plan. Inference 2011, 141, 359–372. [Google Scholar] [CrossRef]
- Ezzahrioui, M.; Ould-Sad, E. Asymptotic Normality of a Nonparametric Estimator of the Conditional Mode Function for Functional Data. J. Nonparametr. Stat. 2008, 20, 3–18. [Google Scholar] [CrossRef]
- Ezzahriouia, M.; Ould-Said, E. Some Asymptotic Results of a Nonparametric Conditional Mode Estimator for Functional Time-Series data. Stat. Neerl. 2010, 64, 171–201. [Google Scholar] [CrossRef]
- Horvath, L.; Kokoszka, P. Inference for Functional Data with Applications; Springer: New York, NY, USA, 2012. [Google Scholar]
- Ling, N.X.; Wang, C.; Ling, J. Modified Kernel Regression Estimation with Functional Time Series data. Stat. Probab. Lett. 2016, 114, 78–85. [Google Scholar] [CrossRef]
- Abdelmalek, G.; Abdelhak, C. Strong uniform consistency rates of the local linear estimation of the conditional hazard estimator for functional data. Int. J. Appl. Math. Stat. 2020, 59, 1–13. [Google Scholar]
- Mustapha, M.; Salim, B.; Ali, L. The consistency and asymptotic normality of the kernel type expectile regression estimator for functional data. J. Multivar. Anal. 2021, 181. [Google Scholar] [CrossRef]
- Kudraszow, N.L.; Vieu, P. Uniform Consistency of kNN Regressors for Functional Variables. Stat. Probab. Lett. 2013, 83, 1863–1870. [Google Scholar] [CrossRef]
- Kara, L.Z.; Laksaci, A.; Rachdi, M.; Vieu, F. Data-driven kNN Estimation in Nonparametric Functional Data Analysis. J. Multivar. Anal. 2017, 153, 176–188. [Google Scholar] [CrossRef]
- Joag-Dev, K.; Proschan, F. Negative Association of Random Variables with Application. Ann. Stat. 1983, 11, 286–295. [Google Scholar] [CrossRef]
- Yi, W.; Wang, X.; Sung, S.H. Complete Moment Convergence for Arrays of Rowwise Negatively Associated Random Variables and its Application in Non-parametric Regression Model. Probab. Eng. Inf. Sci. 2017, 32, 37–57. [Google Scholar]
- Delsol, L. Advances on Asymptotic Normality in Nonparametric Functional Time Series Analysis. Statistics 2009, 43, 13–33. [Google Scholar] [CrossRef]
- Ferraty, F.; Rabhi, A.; Vieu, P. Conditional Quantiles for Dependent Functional Data with Application to the Climatic El Niño Phenomenon. Sankhyā Indian J. Stat. 2005, 67, 378–398. [Google Scholar]
- Ferraty, F.; Laksaci, A.; Tadj, A.; Vieu, P. Rate of Uniform Consistency for Nonparametric Estimates with Functional Variables. J. Stat. Plan. Inference 2010, 140, 335–352. [Google Scholar] [CrossRef]
- Christofides, T.C.; Hadjikyriakou, M. Exponential Inequalities for N-demimartingales and Negatively Associated Random Variables. Stat. Probab. Lett. 2009, 79, 2060–2065. [Google Scholar] [CrossRef][Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).




