Universal Local Linear Kernel Estimators in Nonparametric Regression

Yuliana Linke; Igor Borisov; Pavel Ruzankin; Vladimir Kutsenko; Elena Yarovaya; Svetlana Shalnova

doi:10.3390/math10152693

Abstract

New local linear estimators are proposed for a wide class of nonparametric regression models. The estimators are uniformly consistent regardless of satisfying traditional conditions of dependence of design elements. The estimators are the solutions of a specially weighted least-squares method. The design can be fixed or random and does not need to meet classical regularity or independence conditions. As an application, several estimators are constructed for the mean of dense functional data. The theoretical results of the study are illustrated by simulations. An example of processing real medical data from the epidemiological cross-sectional study ESSE-RF is included. We compare the new estimators with the estimators best known for such studies.

Keywords:

nonparametric regression; kernel estimator; local linear estimator; uniform consistency; fixed design; random design; dependent design elements; mean of dense functional data; epidemiological research

MSC:

62G08

1. Introduction

In this paper, we consider a nonparametric regression model, where bivariate observations

{(X_{1}, z_{1}), \dots, (X_{n}, z_{n})}

satisfy the following equations:

X_{i} = f (z_{i}) + ε_{i}, i = 1, \dots, n,

(1)

where

{f (t), t \in [0, 1]}

, is an unknown random function (process) which is almost surely continuous, the design

{z_{i}; i = 1, \dots, n}

consists of a set of observable random variables with possibly unknown distributions lying in

[0, 1]

, and the design points are not necessarily independent or identically distributed. We will consider the design as a triangular array, i.e., the random variables

{z_{i}; i = 1, \dots, n}

may depend on n. In particular, this scheme includes regression models with fixed design. The random regression function

f (t)

is not supposed to be design-independent. We will give below some fairly standard conditions for the regression analysis on the random errors

{ε_{i}; i = 1, \dots, n}

. In particular, they are supposed to be centered, not necessarily independent or identically distributed.

The paper is devoted to constructing uniformly consistent estimators for the regression function

f (t)

under minimal assumptions on the correlation of design points.

The most popular kernel estimation procedures in the classical case of nonrandom regression function are apparently related with the estimators of Nadaray–Watson, Priestley–Zhao, Gasser–Müller, local polynomial estimators, as well as their modifications (e.g., see [1,2,3,4,5]). We are primarily interested in the dependence conditions of design elements

{z_{i}}

. In this regard, a huge number of publications in the field of nonparametric regression can be conditionally divided into two groups. We will classify papers with a random design to the first one, and to the second one with a fixed design.

In the papers dealing with random design, either independent and identically distributed quantities are considered or, as a rule, stationary sequences of observations that satisfy one or another known form of dependence. In particular, various types of mixing conditions, schemes of moving averages, associated random variables, Markov or martingale properties, and so on have been used. In this regard, we note, for example, the papers [3,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. In the recent papers [23,24,25,26], nonstationary sequences of design elements with one or another special type of dependence are considered (Markov chains, autoregression, partial sums of moving averages, etc.). In the case of fixed design, in the overwhelming majority of works, certain conditions for the regularity of the design are assumed (e.g., see [9,10,27,28,29,30,31,32,33]). So, the nonrandom design points

z_{i}

are most often given by the formula

z_{i} = g (i / n) + o (1 / n)

with some function g of bounded variation, where the error

o (1 / n)

is uniform in all

i = 1, \dots, n

. If g is linear then we obtain a so-called equidistant design. Another version of the regularity condition is the relation

{max}_{i \leq n} (z_{i} - z_{i - 1}) = O (1 / n)

(here it is assumed that the design elements ranged in increasing order).

The problem of uniform approximation of a regression function has been studied by many authors (e.g., see [7,9,10,14,15,17,20,22,26,30,34,35,36], and the references there).

In connection with studying the random regression function

f (t)

, we note, for example, the papers [37,38,39,40,41,42,43,44,45,46] where the mean and covariance functions of the random regression function f are estimated in the case when, for N independent copies

f_{1}, \dots, f_{N}

of the function f, noisy values of each of these trajectories are observed for some collection of design elements (the design can be either common to all trajectories or different from series to series). Estimation of the mean and covariance functions is an actively developing area of nonparametric estimation, especially in the last couple of decades, which is both of independent interest and plays an important role for some subsequent analysis of the random process f (e.g., see [39,40,45,47,48,49]). We consider one of the variants of this problem as an application of the main result.

The purpose of this article is to construct estimators that are uniformly consistent (in the sense of convergence in probability) not only in the above-mentioned review of cases of dependence, but also for significantly different correlations of observations when the conditions of ergodicity or stationarity are not satisfied, as well as the classical mixing conditions and other well-known dependence restrictions. Note that the proposed estimators belong to the class of local linear kernel estimators, but with some different weights than in the classical version. In this case, instead of the original observations, we consider their concomitants associated with the variational series based on the design observations, and their spacings are taken as the additional weights for the corresponding weighted least-square method generating the above-mentioned new estimators. It is important to emphasize that these estimators have the property of universality regarding the nature of dependence of observations: the design can be either fixed and not necessarily regular, or random, while not necessarily satisfying the traditional correlation conditions. In particular, the only condition for design points that guarantees the uniform consistency of new estimators is the condition for dense filling of the domain of definition of the regression function. In our opinion, this condition is very clear and in fact, it is necessary to restore the function on the area of defining design elements. Previously, similar ideas were implemented in [50] for slightly different evaluations (in detail, see Section 4). Similar conditions for design elements were also used in [51,52] in nonparametric regression, and in [53,54,55]—in nonlinear regression.

The paper has the following structure. Section 2 contains the main results. Section 3 discusses the problem of estimating the mean function of a stochastic process. Comparison of the universal local linear estimators with some known ones is given in Section 4. Section 5 contains some results of computer simulation. In Section 6, we compare the results of using the new universal local linear estimators with the most common approaches of data analysis based on the epidemiological research ESSE-RF. In Section 7, we briefly summarize the results of the study. The proofs of the results from Section 2, Section 3 and Section 4 are referred to Section 8.

2. Main Results

We need a number of assumptions.

(D)

The observations

X_{1}, \dots, X_{n}

are represented in the form (1), where the unknown random regression function

{f (t), t \in [0, 1]}

, is almost surely continuous. The design points

{z_{i}; i = 1, \dots, n}

are a set of observable random variables with values in

[0, 1]

, having, generally speaking, unknown distributions, not necessarily independent or equally distributed. Moreover, the random variables

{z_{i}; i = 1, \dots, n}

may depend on n, i.e., can be considered as an array of design observations. The random function

f (t)

may be design-dependent.

(E)

For all

n \geq 1

, the unobservable random errors

{ε_{i}; i = 1, \dots, n}

satisfy with probability 1 the following conditions for all

i, j \leq n

and

i \neq j

:

E_{F_{n}} ε_{i} = 0, sup_{i \leq n} E_{F_{n}} ε_{i}^{2} \leq σ^{2}, E_{F_{n}} ε_{i} ε_{j} = 0,

(2)

where the constant

σ^{2} > 0

may be unknown and does not depend on n, the symbol

E_{F_{n}}

stands for the conditional expectation given the σ-field generated both by the paths of the random process

f (\cdot)

and by the random variables

{z_{i}; i = 1, \dots, n}

.

(K)

A kernel

K (t)

,

t \in R

, is equal to zero outside the interval

[- 1, 1]

and is the density of a symmetric distribution with the support in

[- 1, 1]

, i.e.,

K (t) \geq 0

,

K (t) = K (- t)

for all

t \in [- 1, 1]

, and

\int_{- 1}^{1} K (t) d t = 1

. We assume that the function

K (t)

satisfies the Lipschitz condition with constant

1 \leq L \leq \infty

and

K (\pm 1) = 0

.

In the future, we denote by

κ_{j}

,

j = 0, 1, 2, 3

, the absolute jth moment of the distribution with density

K (t)

, i.e.,

κ_{j} = \int_{- 1}^{1} {| u |}^{j} K (u) d u

. Put

K_{h} (t) = h^{- 1} K (h^{- 1} t)

. It is clear that

K_{h} (s)

is a probability density with support lying in

[- h, h]

. We need also the notation

{∥ K ∥}^{2} = \int_{- 1}^{1} K^{2} (u) d u, κ_{j} (α) = \int_{- 1}^{α} t^{j} K (t) d t, α \in [0, 1], j = 0, 1, 2, 3 .

Remark 1.

We emphasize that assumption

(D)

includes a fixed-design situation. We consider the segment

[0, 1]

as an area of design change solely for the sake of simplicity of exposition of the approach. In the general case, instead of the segment

[0, 1]

, one can consider an arbitrary Jordan measurable subset of

R

.

Further, we denote by

z_{n : 1} \leq \dots \leq z_{n : n}

the order statistics constructed by the sample

{z_{i}; i = 1, \dots, n}

. Put

z_{n : 0} : = 0, z_{n : n + 1} : = 1, Δ z_{n i} : = z_{n : i} - z_{n : i - 1}, i = 1, \dots, n + 1 .

For every i, the response variable and the random error from (1) associated with the order statistic

z_{n : i}

will be denoted by

X_{n i}

and

ε_{n i}

, respectively. It is easy to see that the new errors

{ε_{n i}; i = 1, \dots, n}

satisfy condition

(E)

as well. Next, by

O_{p} (η_{n})

we denote a random variable

ζ_{n}

such that, for all

M > 0

, one has

\underset{n \to \infty}{lim sup} P (| ζ_{n} | / η_{n} > M) \leq β (M),

where

{lim}_{M \to \infty} β (M) = 0

and

{η_{n}}

are positive (maybe random or not) variables and the function

β (M)

that may depend on the kernel K and

σ^{2}

. We agree that, throughout what follows, all limits, unless otherwise stated, are taken for

n \to \infty

.

Let us introduce one more constraint, which is the crucial condition of the paper (in particular, the only condition on design points that guarantees the existence of a uniformly consistent estimator; see also the comments at the end of the section).

(D_{0})

The following limit relation holds:

δ_{n} : = max_{1 \leq i \leq n + 1} Δ z_{n i} \overset{p}{\to} 0

.

Finally, for any

h \in (0, 1)

, we introduce into consideration the following class of estimators for the regression function f:

\begin{matrix} {\hat{f}}_{n, h} (t) : = I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} \frac{w_{n 2} (t) - (t - z_{n : i}) w_{n 1} (t)}{w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)} X_{n i} K_{h} (t - z_{n : i}) Δ z_{n i}, \end{matrix}

(3)

where

I (\cdot)

is the indicator function,

c_{*} \equiv c_{*} (K) : = \frac{κ_{2} - κ_{1}^{2}}{96 L (6 L + κ_{2} + κ_{1} / 2)} < \frac{1}{864 L};

(4)

hereinafter, we use the notation

w_{n j} (t) : = \sum_{i = 1}^{n} {(t - z_{n : i})}^{j} K_{h} (t - z_{n : i}) Δ z_{n i}, j = 0, 1, 2, 3 .

Remark 2.

It is easy to see that the difference

κ_{2} - κ_{1}^{2}

is the variance of a non-degenerate distribution; thus, this is strictly positive.

Remark 3.

It is easy to verify that kernel estimator (3), without the indicator factor, is the first coordinate of the two-dimensional estimate of the weighted least-squares method, i.e., of the two-dimensional point

(a^{*}, b^{*})

at which the following minimum is attained:

\begin{matrix} min_{a, b} \sum_{i = 1}^{n} {(X_{n i} - (a + b (t - z_{n : i})))}^{2} K_{h} (t - z_{n : i}) Δ z_{n i} . \end{matrix}

(5)

Thus, the proposed class of estimators in a certain sense (in fact, by construction) is close to the classical local linear kernel estimators, but in the weighted least squares method (5) we use slightly different weights.

Remark 4.

In the case when there are multiple design points, some spacings

Δ z_{n i}

vanish, and we lose some of the sample information in the estimator (3). In this case, it is proposed, before using the estimator (3), to slightly reduce the sample by replacing the observations

X_{i}

with the same points

z_{i}

with their sample mean and leaving only one design point out of multiples in the new sample. In this case, the averaged observations will have less noise. So, despite the smaller size of the new sample, we do not lose the information contained in the original sample.

Let us further agree to denote by

C_{j}

,

j \geq 1

, absolute positive constants, and by

C_{j}^{*}

, positive constants depending only on the kernel K. The main result of this section is as follows.

Theorem 1.

Let conditions

(D)

,

(E)

, and

(K)

be satisfied. Then, for any fixed

h \in (0, 1 / 2)

, with probability 1 it is satisfied

\begin{matrix} sup_{t \in [0, 1]} | {\hat{f}}_{n, t} (t) - f (t) | \leq C_{1}^{*} ω_{f} (h) + ζ_{n} (h), \end{matrix}

(6)

where

ω_{f} (h) : = sup_{u, v \in [0, 1] : | u - v | \leq h} | f (u) - f (v) |

and the random variable

ζ_{n} (h)

meets the relation

\begin{matrix} P (ζ_{n} (h) > y, δ_{n} \leq c_{*} h) \leq C_{2}^{*} σ^{2} \frac{E δ_{n}}{h^{2} y^{2}}, \end{matrix}

(7)

with the constant

c_{*}

from (4).

Remark 5.

As follows from the proof of Theorem 1, the constants

C_{1}^{*}

and

C_{2}^{*}

have the following structure:

C_{1}^{*} = C_{1} \frac{L^{2}}{κ_{2} - κ_{1}^{2}}, C_{2}^{*} = C_{2} \frac{L^{4}}{{(κ_{2} - κ_{1}^{2})}^{2}} .

Remark 6.

Since

δ_{n} \leq 1

, then under condition

(D_{0})

the limit relation

E δ_{n} \to 0

holds. Therefore, taking into account Theorem 1, we can assert that

ζ_{n} (h) = O_{p} (h^{- 1} {(E δ_{n})}^{1 / 2})

. Thus, the bandwidth h can be determined, for example, by the relation

\begin{matrix} h_{n} = sup \{h > 0 : P (ω_{f} (h) \geq h^{- 1} {(E δ_{n})}^{1 / 2}) \leq h^{- 1} {(E δ_{n})}^{1 / 2}\} . \end{matrix}

(8)

It is easy to see that, when

(D_{0})

is satisfied, the limit relations

h_{n} \to 0

and

h_{n}^{- 1} {(E δ_{n})}^{1 / 2} \to 0

hold. In fact, the value of

h_{n}

equalizes in h the order of smallness in probability of both terms on the right-hand side of the relation (6). Note also that, for nonrandom f, one can choose

h \equiv h_{n}

as a solution to the equation

\begin{matrix} h^{- 1} {(E δ_{n})}^{1 / 2} = ω_{f} (h) . \end{matrix}

(9)

It is clear that this solution tends to zero as n grows.

The relations (8) and (9) allow us to obtain the order of smallness of the optimal bandwidth h, but not the optimal value of h. In practice, h can be chosen, for example, by cross-validation.

From Theorem 1 and Remark 6 it is easy to obtain the following corollary.

Corollary 1.

Let the conditions

(D)

,

(D_{0})

,

(K)

, and

(E)

be satisfied, the regression function

f (t)

be nonrandom, and

C

be an arbitrary subset of equicontinuous functions in

C [0, 1]

(for example, a precompact set). Then

γ_{n} (C) = sup_{f \in C} sup_{t \in [0, 1]} | {\hat{f}}_{n, {\tilde{h}}_{n}} (t) - f (t) | \overset{p}{\to} 0,

where

{\tilde{h}}_{n}

is defined by equation (9), in which the modulus of continuity

ω_{f} (h)

is replaced with the universal modulus

ω_{C} (h) = {sup}_{f \in C} ω_{f} (h)

. Moreover, the asymptotic relation

γ_{n} (C) = O_{p} (ω_{C} ({\tilde{h}}_{n}))

holds.

Remark 7.

It is easy to see that for a nonrandom

f (t)

the modulus of continuity in (9) can be replaced by one or another upper bound for

ω_{C} (h)

, obtaining the corresponding upper bound for

γ_{n} (C)

. Consider the case

E δ_{n} = O (1 / n)

. If

C

consists of functions

f (t)

satisfying the Hölder condition with exponent

α \in (0, 1]

and a universal constant then

{\tilde{h}}_{n} = O (n^{- \frac{1}{2 (1 + α)}})

and

ω_{C} ({\tilde{h}}_{n}) = O (n^{- \frac{α}{2 (1 + α)}})

. In particular, if the functions from

C

satisfy the Lipschitz condition (

α = 1

) with a universal constant then

γ_{n} (C) = O_{p} (n^{- 1 / 4})

.

From Theorem 1 and Remark 6 we obtain the following corollary.

Corollary 2.

Let the conditions

(D)

,

(D_{0})

,

(K)

, and

(E)

be satisfied and let the modulus of continuity

ω_{f} (h)

of the random regression function

f (t)

with probability 1 admit the upper bound

ω_{f} (h) \leq ζ d (h)

, where

ζ > 0

is a random variable and

d (h)

is a positive continuous nonrandom function such that

d (h) \to 0

as

h \to 0

. Then

\begin{matrix} sup_{t \in [0, 1]} | {\hat{f}}_{n, \hat{h} (n)} (t) - f (t) | \overset{p}{\to} 0, \end{matrix}

(10)

where the value

{\hat{h}}_{n}

is defined in (9) after replacement

d (h)

.

Let us discuss in more detail condition

(D_{0})

. Obviously, condition

(D_{0})

is satisfied for any nonrandom regular design (this is the case of nonidentically distributed

{z_{i}}

depending on n). If

{z_{i}}

are independent and identically distributed and the interval

[0, 1]

is the support of distribution of

z_{1}

, then condition

(D_{0})

is also satisfied. In particular, if the distribution density of

z_{1}

is separated from zero on

[0, 1]

, then

δ_{n} = O (log n / n)

holds (see details in [50]). If

{z_{i}; i \geq 1}

is a stationary sequence with a marginal distribution with the support

[0, 1]

, satisfying an

α

-mixing condition, then condition

(D_{0})

is also satisfied (see Remark 8 below). Note that the dependence of the random variables

{z_{i}}

satisfying condition

(D_{0})

can be much stronger, which is illustrated in the following example.

Example 1.

Let the sequence of random variables

{z_{i}; i \geq 1}

be defined by the relation

z_{i} = ν_{i} u_{i}^{l} + (1 - ν_{i}) u_{i}^{r},

(11)

where

{u_{i}^{l}}

and

{u_{i}^{r}}

are independent and uniformly distributed on

[0, 1 / 2]

and

[1 / 2, 1]

, respectively, the sequence

{ν_{i}}

does not depend on

{u_{i}^{l}}

,

{u_{i}^{r}}

and consists of Bernoulli random variables with success probability

1 / 2

, i.e., the distribution of random variables

z_{i}

is an equilibrium mixture of two uniform distributions on the corresponding intervals. The dependence between the random variables

ν_{i}

for any natural number i is defined by the equalities

ν_{2 i - 1} = ν_{1}

and

ν_{2 i} = 1 - ν_{1}

. In this case, the random variables

{z_{i}; i \geq 1}

in (11) form a stationary sequence of random variables uniformly distributed on the segment

[0, 1]

, satisfying condition

(D_{0})

. On the other hand, for all natural numbers m and n,

\begin{matrix} P (z_{2 m} \leq 1 / 2, z_{2 n - 1} \leq 1 / 2) = 0 . \end{matrix}

Thus, all the known conditions for the weak dependence of random variables (in particular, the mixing conditions) are not satisfied here.

According to the scheme of this example, it is possible to construct various sequences of dependent random variables uniformly distributed on

[0, 1]

by choosing sequences of Bernoulli switches with the conditions

ν_{j_{k}} = 1

and

ν_{l_{k}} = 0

for infinite numbers of indices

{j_{k}}

and

{l_{k}}

. In which case, condition

(D_{0})

will also be satisfied, but the corresponding sequence

{z_{i}}

(not necessarily stationary) may not even satisfy the strong law of large numbers. For example, this is the case when

ν_{j} = 1 - ν_{1}

for

j = 2^{2 k - 1}, \dots, 2^{2 k} - 1

, and

ν_{j} = ν_{1}

for

j = 2^{2 k}, \dots, 2^{2 k + 1} - 1

, where

k = 1, 2, \dots

(i.e., we randomly choose one of the two segments

[0, 1 / 2]

and

[1 / 2, 1]

, into which we randomly throw the first point, and then alternate the selection of one of the two segments by the following numbers of elements of the sequence: 1, 2,

2^{2}

,

2^{3}

, etc.). Indeed, we can introduce the notation

n_{k} = 2^{2 k} - 1

,

{\tilde{n}}_{k} = 2^{2 k + 1} - 1

,

S_{m} = \sum_{i = 1}^{m} z_{i}

and note that, for all elementary events from the event

{ν_{1} = 1}

, one has

\frac{S_{n_{k}}}{n_{k}} = \frac{1}{n_{k}} \sum_{i \in N_{1, k}} u_{i}^{l} + \frac{1}{n_{k}} \sum_{i \in N_{2, k}} u_{i}^{r},

where

N_{1, k}

and

N_{2, k}

are the sets of indices, for which the observations

{z_{i}, i \leq n_{k}}

lie in the intervals

[0, 1 / 2]

or

[1 / 2, 1]

, respectively. It is easy to see that

# (N_{1, k}) = n_{k} / 3

and

# (N_{2, k}) = 2 # (N_{1, k})

. Hence,

S_{n_{k}} / n_{k} \to 7 / 12

almost surely as

k \to \infty

due to the strong law of large numbers for the sequences

{u_{i}^{l}}

and

{u_{i}^{r}}

. On the other hand, as

k \to \infty

, for all elementary events from

{ν_{1} = 1}

one has

\frac{S_{{\tilde{n}}_{k}}}{{\tilde{n}}_{k}} = \frac{1}{{\tilde{n}}_{k}} \sum_{i \in {\tilde{N}}_{1, k}} u_{i}^{l} + \frac{1}{{\tilde{n}}_{k}} \sum_{i \in {\tilde{N}}_{2, k}} u_{i}^{r} \to \frac{5}{12},

(12)

where

{\tilde{N}}_{1, k}

and

{\tilde{N}}_{2, k}

are the sets of indices, for which the observations

{z_{i}, i \leq {\tilde{n}}_{k}}

lie in the intervals

[0, 1 / 2]

or

[1 / 2, 1]

, respectively. Proving the convergence in (12), we took into account that

# ({\tilde{N}}_{1, k}) = (2^{2 k + 2} - 1) / 3

and

# ({\tilde{N}}_{2, k}) = 2 n_{k} / 3

, i.e.,

# ({\tilde{N}}_{1, k}) = 2 # ({\tilde{N}}_{2, k}) + 1

.

Similar arguments are valid for all elementary events from

{ν_{1} = 0}

.

Remark 8.

In the case of i.i.d. random variables

{z_{i}}

, condition

(D_{0})

will be fulfilled if, for all

δ \in (0, 1)

,

p_{n} (δ) \equiv sup_{| Δ | = δ} P (⋂_{i \leq n} {z_{i} \notin Δ}) \to 0,

(13)

where the supremum is taken over all intervals

Δ \subset [0, 1]

of length δ. Indeed, for any natural

N > 1

, we divide the interval

[0, 1]

into N subintervals

Δ_{k}

,

k = 1, \dots, N

, of length

1 / N

. Then one has

P (max_{1 \leq i \leq n + 1} Δ z_{n i} > \frac{2}{N}) \leq \sum_{k = 1}^{N} P (⋂_{i \leq n} {z_{i} \notin Δ_{k}}) \leq N max_{k} P (⋂_{i \leq n} {z_{i} \notin Δ_{k}}) \leq N p_{n} (1 / N),

since the event

\{{max}_{1 \leq i \leq n + 1} Δ z_{n i} > 2 / N\}

implies the existence of an interval

Δ_{k}

of length

1 / N

that does not contain any points from the collection

{z_{i}}

. Thereby, condition (13) implies the limit relation

{max}_{i \leq n + 1} Δ z_{n i} \overset{p}{\to} 0

, which is equivalent to convergence with probability 1 due to the monotonicity of the sequence

{max}_{i \leq n + 1} Δ z_{n i}

. In particular, if

{z_{i}}

are independent then

p_{n} (δ) = e^{- c (δ) n}

and

c (δ) > 0

, i.e., as

n \to \infty

, the finite collection

{z_{i}}

with probability 1 forms a refining partition of the finite segment

[0, 1]

. It is easy to show that if

{z_{i}; i \geq 1}

is a stationary sequence satisfying an α-mixing condition and having a marginal distribution with the support

[0, 1]

then (13) will be valid.

3. Estimating the Mean Function of a Stochastic Process

Consider the following statement of the problem of estimating the expectation of an almost surely continuous stochastic process

f (t)

. There are N independent copies of the regression Equation (1):

X_{i, j} = f_{j} (z_{i, j}) + ε_{i, j}, i = 1, \dots, n, j = 1, \dots, N,

(14)

where

f (t), f_{1} (t), \dots, f_{N} (t)

,

t \in [0, 1]

, are independent identically distributed almost surely continuous unknown random processes, the set

{ε_{i, j}; i = 1, \dots, n}

satisfies condition

(E)

for any j, the set

{z_{i, j}; i = 1, \dots, n}

meets conditions

(D)

and

(D_{0})

for any j (here and below the index j for the considered random variables means the number of copy of Model (1)). In particular, under the assumption that condition

(K)

is valid, by

{\hat{f}}_{n, h, j} (t)

,

j = 1, \dots, N,

we denote the estimator given by the relation (3) when replacing the values from (1) with the corresponding characteristics from (14). Finally, an estimator for the mean-function is determined by the equality

\bar{{\hat{f}}_{N, n, h}} (t) = \frac{1}{N} \sum_{j = 1}^{N} {\hat{f}}_{n, h, j} (t) .

(15)

As a consequence of Theorem 1, we obtain the following assertion.

Theorem 2.

Let Model (14) satisfy the above-mentioned conditions and, moreover,

E sup_{t \in [0, 1]} | f (t) | < \infty,

(16)

while the sequences

h \equiv h_{n} \to 0

and

N \equiv N_{n} \to \infty

meet the restrictions

h^{- 2} E δ_{n} \to 0 a n d N P (δ_{n} > c_{*} h) \to 0 .

(17)

Then

sup_{t \in [0, 1]} |\bar{{\hat{f}}_{N, n, h}} (t) - E f (t)| \overset{p}{\to} 0 .

(18)

Remark 9.

If condition (16) is replaced with a slightly stronger constraint

E {sup}_{t \in [0, 1]} f^{2} (t) < \infty

then, under conditions similar to (17), one can prove the uniform consistency of the estimator

{\hat{M}}_{N, n, h} (t_{1}, t_{2}) = \frac{1}{N} \sum_{j = 1}^{N} {\hat{f}}_{n, h, j} (t_{1}) {\hat{f}}_{n, h, j} (t_{2}), t_{1}, t_{2} \in [0, 1],

for the unknown mixed second moment

E f (t_{1}) f (t_{2})

where

h \equiv h_{n}

and

N \equiv N_{n}

satisfy (17). The arguments in proving this fact are quite similar to those in proving Theorem 2 and they are omitted. In other words, under the above-mentioned restrictions, the estimator

{\hat{Cov}}_{N, n, h} (t_{1}, t_{2}) = {\hat{M}}_{N, n, h} (t_{1}, t_{2}) - \bar{{\hat{f}}_{N, n, h}} (t_{1}) \bar{{\hat{f}}_{N, n, h}} (t_{2})

is uniformly consistent for the covariance of the random regression function

f (t)

.

Remark 10.

The problem of estimating the mean and covariance functions plays a fundamental role in the so-called functional data analysis (see, for example, [39,40,47,48]). The property of uniform consistency of certain estimates of the mean function, which is important in the context of the problem under consideration, was considered, for example, in [37,40,43,45,47]. For a random design, as a rule, it is assumed that all its elements are independent identically distributed random variables (see, for example, [37,38,40,42,43,44,45,46,56,57]). In the case where the design is deterministic, certain regularity conditions discussed above in the Introduction are usually used. Moreover, in the problem of estimating the mean function, it is customary to subdivide design elements into certain types depending on the density of filling with the design points the regression function domain. The literature focuses on two types of data: or the design is in some sense “sparse” (for example, the number of design elements in each series is uniformly limited [37,38,40,56,57]), or the design is somewhat “dense” (the number of elements in each series grows with the number of series [37,40,44,57,58]). Theorem 2 considers the second of the specified types of design under condition

(D_{0})

in each of the independent series. Note that our formulation of the problem of estimating the mean function also includes the situation of a general deterministic design.

Note that the methodologies for estimating the mean function used for dense or sparse data are often different (see, for example, [48,49]). In the situation of a growing number of observations in each series, it is natural to preliminarily estimate trajectories of a random regression function in each series, and then average over all series (e.g., see [38,44,56]). This is exactly what we do in (15) following this conventional approach.

4. Comparison with Some Known Approaches

In [50], under the conditions of the present paper, the following estimators were studied:

f_{n, h}^{*} (t) = \frac{\sum_{i = 1}^{n} X_{n i} K_{h} (t - z_{n : i}) Δ z_{n i}}{\sum_{i = 1}^{n} K_{h} (t - z_{n : i}) Δ z_{n i}} \equiv \frac{\sum_{i = 1}^{n} X_{n i} K_{h} (t - z_{n : i}) Δ z_{n i}}{w_{n 0} (t)} .

(19)

Notice that

f_{n, h}^{*} (t) \equiv \arg min_{a} \sum_{i = 1}^{n} {(X_{n i} - a)}^{2} K_{h} (t - z_{n : i}) Δ z_{n i} .

(20)

It is interesting to compare the new estimators

{\hat{f}}_{n, h} (t)

with the estimators

f_{n, h}^{*} (t)

from [50] as well as with other estimators (for example, the Nadaraya–Watson estimators

{\hat{f}}_{N W} (t)

and classical local linear estimators

{\hat{f}}_{L L} (t)

). Throughout this section, we assume that conditions

(D)

,

(K)

, and

(E)

are satisfied and the regression function

f (t)

is nonrandom. Moreover, we need the following constraint.

(IID)

The regression function

f (t)

in Model (1) twice continuously differentiable, the errors

{ε_{i}}

are independent, identically distributed, centered, and independent of the design

{z_{i}}

, whose elements are independent and identically distributed. In addition, the distribution function of the random variable

z_{1}

has a strictly positive density

p (t)

continuously differentiable on

(0, 1)

.

Such severe restrictions on the parameters of the regression model are explained both by problems in calculating the asymptotic representation for the variances of the estimators

{\hat{f}}_{n, h} (t)

and

f_{n, h}^{*} (t)

as well as by properties of the Nadaraya–Watson estimators, which are very sensitive to the nature of the correlation of design elements.

For any statistical estimator

{\tilde{f}}_{n} (t)

of the regression function

f (t)

, we will use the notation

Bias {\tilde{f}}_{n} (t)

for its bias, i.e.,

Bias {\tilde{f}}_{n} (t) : = E {\tilde{f}}_{n} (t) - f (t) .

Put

\bar{f} = {sup}_{t \in [0, 1]} | f (t) |

and for

j = 0, 1, 2, 3

, introduce the notation

\begin{matrix} w_{j} (t) = \int_{0}^{1} {(t - z)}^{j} K_{h} (t - z) d z = \int_{z \in [0, 1] : | t - z | \leq h} {(t - z)}^{j} K_{h} (t - z) d z, t \in [0, 1] . \end{matrix}

(21)

The following asymptotic representation for the bias and variance of the estimator

f_{n, h}^{*} (t)

was obtained in [50].

Proposition 1.

Let condition

(I I D)

be fulfilled and

{inf}_{t \in [0, 1]} p (t) > 0

. If

n \to \infty

and

h \to 0

so that

{(log n)}^{- 1} h \sqrt{n} \to \infty

,

h^{- 2} E δ_{n} \to 0

, and

h^{- 3} E δ_{n}^{2} \to 0

then, for any

t \in (0, 1)

, the following asymptotic relations are valid:

Bias f_{n, h}^{*} (t) = \frac{h^{2} κ_{2}}{2} f^{″} (t) + o (h^{2}), V a r f_{n, h}^{*} (t) \sim \frac{2 σ^{2}}{h n p (t)} {∥ K ∥}^{2} .

Note that the first statement concerning the asymptotic behavior of the bias in Proposition 1 was actually proved for arbitrarily dependent design elements when condition

(D_{0})

is met. The following two propositions and corollaries are also obtained without any assumptions about correlation of design elements, only conditional centering and conditional orthogonality of the errors from condition

(E)

are used.

Proposition 2.

Let

h < 1 / 2

. Then, for any fixed

t \in [h, 1 - h]

,

Bias {\hat{f}}_{n, h} (t) = Bias f_{n, h}^{*} (t) + γ_{n, h} (t), V a r {\hat{f}}_{n, h} (t) = V a r f_{n, h}^{*} (t) + ρ_{n, h} (t),

where

| γ_{n, h} (t) | \leq C_{3}^{*} \bar{f} h^{- 1} E δ_{n}, | ρ_{n, h} (t) | \leq C_{4}^{*} (σ^{2} + {\bar{f}}^{2}) h^{- 1} E δ_{n} .

Proposition 3.

Let the regression function

f (t)

be twice continuously differentiable. Then, for any fixed

t \in (0, 1)

,

Bias {\hat{f}}_{n, h} (t) = \frac{f^{″} (t)}{2} B_{0} (t) + O (E δ_{n} / h) + o (h^{2}),

(22)

where

B_{0} (t) = \frac{w_{2}^{2} (t) - w_{3} (t) w_{1} (t)}{w_{0} (t) w_{2} (t) - w_{1}^{2} (t)} .

(23)

Moreover,

Bias f_{n, h}^{*} (t) = - f^{'} (t) \frac{w_{1} (t)}{w_{0} (t)} + \frac{f^{″} (t)}{2} \frac{w_{2} (t)}{w_{0} (t)} + O (E δ_{n}) + o (h^{2}),

(24)

besides, the error terms

o (h^{2})

and

O (\cdot)

in (22) and (24) are uniform in t.

Corollary 3.

Let the regression function

f (t)

be twice continuously differentiable,

h \to 0

, and

h^{- 3} E δ_{n} \to 0

. Then, for each fixed

t \in (0, 1)

such that

f^{″} (t) \neq 0

, the following asymptotic relations are valid:

Bias {\hat{f}}_{n, h} (t) \sim Bias f_{n, h}^{*} (t) \sim \frac{f^{″} (t)}{2} κ_{2} h^{2} .

Corollary 4.

Suppose that, under the conditions of the previous corollary, f has nonzero first and second derivatives in a neighborhood of zero. Then for any fixed positive

α < 1

such that

κ_{1} (α) < 0

, the following asymptotic relations hold:

Bias {\hat{f}}_{n, h} (α h) \sim \frac{1}{2} h^{2} D (α) f^{″} (0 +), Bias f_{n, h}^{*} (α h) \sim - h \frac{κ_{1} (α)}{κ_{0} (α)} f^{'} (0 +),

where

D (α) = \frac{κ_{2}^{2} (α) - κ_{3} (α) κ_{1} (α)}{κ_{0} (α) κ_{2} (α) - κ_{1}^{2} (α)} .

Note that, due to the Cauchy–Bunyakovsky inequality and the properties of the density

K (\cdot)

, the strict inequality

κ_{0} (α) κ_{2} (α) - κ_{1}^{2} (α) > 0

holds for any

α \in [0, 1]

.

Remark 11.

Similar relations take place in a neighborhood of the right boundary of the segment

[0, 1]

, when

t = 1 - α h

for any

α \leq 1

. In this case, in the above asymptotics, one simply needs to replace the right-hand derivatives at zero by analogous (non-zero) left-hand derivatives at point 1, and instead of the quantities

κ_{j} (α)

must be substituted

{\tilde{κ}}_{j} (α) = \int_{- α}^{1} v^{i} K (v) d v = {(- 1)}^{j} κ_{j} (α)

. In this case, the coefficient

D (α)

will not change, and the corresponding coefficient on the right-hand side of the second asymptotics will only change its sign.

Thus, the qualitative difference between the estimators

f_{n, h}^{*} (t)

and

{\hat{f}}_{n, h} (t)

is observed only in neighborhoods of the boundary points 0 and 1: for the estimator

f_{n, h}^{*} (t)

, in the h-neighborhoods of the indicated points, the order of smallness of the bias is h, and for

{\hat{f}}_{n, h} (t)

this order is

h^{2}

. Such a connection between the estimators (3) and (19) seems to be quite natural in view of the relations (5) and (20), and the known relationship at the boundary points between Nadaraya–Watson estimators

{\hat{f}}_{N W} (t)

and locally linear estimators

{\hat{f}}_{L L} (t)

.

Remark 12.

If condition

(I I D)

is satisfied, then, for the bias and variance of estimators

{\hat{f}}_{N W} (t)

and

{\hat{f}}_{L L} (t)

, the following asymptotic representations are well known (see, for example, [1]), which are valid for any

t \in (0, 1)

under broad conditions on the parameters of the model under consideration:

\begin{matrix} Bias {\hat{f}}_{N W} (t) = \frac{h^{2} κ_{2}}{2 p (t)} (f^{″} (t) p (t) + 2 f^{'} (t) p^{'} (t)) + o (h^{2}), V a r {\hat{f}}_{N W} (t) \sim \frac{σ^{2}}{h n p (t)} {∥ K ∥}^{2}, \\ Bias {\hat{f}}_{L L} (t) = \frac{h^{2} κ_{2}}{2} f^{″} (t) + o (h^{2}), V a r {\hat{f}}_{L L} (t) \sim \frac{σ^{2}}{h n p (t)} {∥ K ∥}^{2} . \end{matrix}

The above asymptotic representations show that if the assumptionss

(I I D)

are valid then the variance of the Nadaraya–Watson estimator

{\hat{f}}_{N W} (t)

and the locally linear estimator

{\hat{f}}_{L L} (t)

under broad conditions is asymptotically half the variance of the estimators

f_{n, h}^{*} (t)

and

{\hat{f}}_{n, h} (t)

, respectively. However, the mean-square error of any estimator is equal to the sum of the variance and squared bias, which for the compared estimators is asymptotically determined by the quantities

f^{″} (t) p (t) + 2 f^{'} (t) p^{'} (t)

or

f^{″} (t) p (t)

, respectively. In other words, if the standard deviation σ of the errors is not very large and

|f^{″} (t) p (t) + 2 f^{'} (t) p^{'} (t)| > |f^{″} (t) p (t)|,

(25)

then the estimator

f_{n, h}^{*} (t)

or

{\hat{f}}_{n, h} (t)

may be more accurate than

{\hat{f}}_{N W} (t)

. The indicated effect for the estimator

f_{n, h}^{*} (t)

is confirmed by the results of computer simulations in [50].

Note also that in order to choose in a certain sense the optimal bandwidth h, the orders of the smallness of the bias and the standard deviation of the estimator are usually equated. In other words, if the assumptions

(I I D)

are fulfilled, for all four types of estimators considered here, we need to solve the equation

h^{2} \approx {(n h)}^{- 1 / 2}

. Thus the optimal bandwidth has the standard order

h \approx n^{- 1 / 5}

.

Remark 13.

Estimators of the form

{\hat{f}}_{n, h} (t)

and

f_{n, h}^{*} (t)

given in (3) and (19) can define a little differently, depending on the choice of one or another partition with highlighted points

{z_{i}; i = 1, \dots, n}

of the domain of the regression function underlying these estimators. For example, using the Voronoi partition of the segment

[0, 1]

, an estimator of the form (19) can be given by the equality

{\tilde{f}}_{n, h}^{*} (t) = \frac{\sum_{i = 1}^{n} X_{n i} K_{h} (t - z_{n : i}) \tilde{Δ} z_{n i}}{\sum_{i = 1}^{n} K_{h} (t - z_{n : i}) \tilde{Δ} z_{n i}},

(26)

where

\tilde{Δ} z_{n 1} = Δ z_{n 1} + Δ z_{n 2} / 2

,

\tilde{Δ} z_{n n} = Δ z_{n n} / 2 + Δ z_{n n + 1}

,

\tilde{Δ} z_{n i} = (Δ z_{n i} + Δ z_{n i + 1}) / 2

for

i = 2, \dots, n - 1

. Looking through the proofs from [50] it is easy to see that in this case all properties of the estimator

{\tilde{f}}_{n, h}^{*} (t)

are preserved, except for the asymptotic representation of the variance. Repeating (with obvious changes) the arguments in proving Proposition 1 in [50], we have

V a r {\tilde{f}}_{n, h}^{*} (t) \sim \frac{1.5 σ^{2}}{h n p (t)} {∥ K ∥}^{2} .

Thus, in the case of independent and identically distributed design points, the asymptotic variance of the estimator can be somewhat reduced by choosing one or another partition.

Similarly, in the definition (3), the estimators

{\hat{f}}_{n, h} (t)

, the quantities

{Δ z_{n i}}

can be replaced by the Voronoi tiling

{\tilde{Δ} z_{n i}}

. It is also worth noting that the indicator factor involved in the determination (3) of the estimator

{\hat{f}}_{n, h} (t)

, does not affect the asymptotic properties of the estimator given in Theorem 1, and we only needed it to calculate the exact asymptotic behavior of the estimator bias.

5. Simulations

In the following computer simulations, instead of estimator (3), we used the equivalent estimator

{\hat{f}}_{n, h} (t)

of the weighted least-squares method defined by the relation

\begin{matrix} ({\hat{f}}_{n, h} (t), \hat{b} (t)) & = & \arg min_{a, b} \sum_{i = 1}^{n} {(X_{n i} - a - b (t - z_{n : i}))}^{2} K_{h} (t - z_{n : i}) \tilde{Δ} z_{n i}, \end{matrix}

(27)

where the quantities

\tilde{Δ} z_{n i}

are defined in (13) above. Estimator (27) differs from estimator (3) by excluding the indicator factor and replacing

Δ z_{n i}

with

\tilde{Δ} z_{n i}

, which is not essential (see Remark 13). If we had several observations at one design point, then the observations were replaced by one observation presenting their arithmetic mean (see Remark 4 above). Although the notation

{\hat{f}}_{n, h} (t)

in (27) is somewhat different from the same notation in (3), we retained the notation

{\hat{f}}_{n, h} (t)

, which will not lead to ambiguity.

In the simulations below, we will also consider the local constant estimator

{\tilde{f}}_{n, h}^{*} (t)

from (26), which can be defined by the equality

{\tilde{f}}_{n, h}^{*} (t) \equiv \arg min_{a} \sum_{i = 1}^{n} {(X_{n i} - a)}^{2} K_{h} (t - z_{n : i}) \tilde{Δ} z_{n i} .

(28)

Here we also replace the observations corresponding to one design point by their arithmetic mean.

Recall that the Nadaraya–Watson estimator differs from (28) by the absence of the factors

\tilde{Δ} z_{n i}

in the weighting coefficients:

{\hat{f}}_{N W} (t) = \frac{\sum_{i = 1}^{n} X_{n i} K_{h} (t - z_{n : i})}{\sum_{i = 1}^{n} K_{h} (t - z_{n : i})} .

(29)

The Nadaraya–Watson estimators are also weighted least-squares estimators:

{\hat{f}}_{N W} (t) \equiv \arg min_{a} \sum_{i = 1}^{n} {(X_{n i} - a)}^{2} K_{h} (t - z_{n : i}) .

(30)

In the following examples, estimators (27) and (28), which will be called universal local linear (ULL) and universal local constant (ULC), respectively, will be compared with the estimator of linear regression (LR), the Nadaraya–Watson (NW) estimator, LOESS of order 1, as well as with estimators of generalized additive models (GAM) and of random forest (RF). For LOESS estimators, the R loess() function was used. Calculating the ULL estimator with the custom script was on average 3.2 times slower than the LOESS estimator calculated by the R loess() function. That may be explained by the fact that the ULL estimator was implemented in R language (in contrast to R loess() whose body is implemented in C and Fortran) and was not optimized for performance.

It is worth noting that, in the examples below, the best results were obtained by the new estimators ULL (27) and ULC (28), LOESS estimator of order 1, and the Nadaraya–Watson estimator.

With regard to the simulation examples, the main difference between the ULL (27) and ULC (28) estimators, and the Nadaraya–Watson and LOESS ones is that ULL (27) and ULC (28) are “more local”. This means that if a function

f (z)

is evaluated on a design interval A with a “small” number of observations adjacent to a design interval B with a “large” number of observations, the Nadaraya–Watson and LOESS estimators will primarily seek to adjust to the “large” cluster of observations on the interval B. At the same time, ULL (27) and ULC (28) will equally consider observations on intervals of equal lengths, regardless of the distribution of design points on the intervals.

In the examples below, for all of the kernel estimators that are the Nadaraya–Watson ones, LOESS, ULL (27), and ULC (28), we used the tricubic kernel

K (t) = \frac{70}{81} max {0, (1 - {| t |}^{3})^{3}} .

We chose the tricubic kernel because that kernel is employed in the R function loess() which was used in the simulations.

The accuracy of the models was estimated with respect to the maximum error and the mean squared error. In all the examples below, except Example 3, the maximum error was estimated on the uniform grid of 1001 points on the segment

[0, 10]

by the formula

max_{j = 1, \dots, 1001} | \overset{ˇ}{f} (t_{j}) - f (t_{j}) |,

where

t_{j}

are the grid points of segment

[0, 10]

,

t_{1} = 0

,

t_{1001} = 10

,

\overset{ˇ}{f} (t_{j})

are the values of the constructed estimator at the points of the partition grid, and

f (t_{j})

are the true values of the estimated function. In Example 3, a grid of 1001 points was taken on the interval from the minimum to the maximum point of the design. That was done in order to to avoid assessing the quality of extrapolation, since, in that example, the minimum design point could fall far from 0.

The mean squared error was calculated for one random splitting of the whole sample into training and validation samples in a proportion of

80 %

to

20 %

, according to the formula

\frac{1}{m} \sum_{j = 1}^{m} {(\overset{ˇ}{f} (z_{j}) - X_{j}))}^{2},

where m is the validation sample size,

z_{j}

are the validation sample design points,

X_{j}

are the noisy observations of the predicted function in the validation sample,

\overset{ˇ}{f}

is the estimate calculated by the training sample. The splittings into training and validation samples were identical for all models.

For each of the kernel estimators, the parameter h of the kernel

K_{h}

was determined using cross-validation, minimizing the mean squared error, where the set of observations was partitioned into 10 folds randomly. The same partitions were taken for all the kernel estimators.

When calculating the root mean square error, the cross-validation for choosing h was carried out on the training set. To calculate the maximum error, the cross-validation was performed on the whole sample. For the Nadaraya–Watson models as well as for ULL (27) and ULC (28), the parameter h was selected from 20 values located on the logarithmic grid from

max {0.0001, 1.1 {max}_{i} Δ z_{n i}}

to 0.9. For LOESS, the parameter span was chosen in the same way from 20 values located on the logarithmic grid from 0.0001 to 0.9.

The simulations also included testing basic statistical learning algorithms: linear regression without regularization, generalized additive model, and random forest [59]. The training of the generalized additive model was carried out using the R library mgcv.

Thin-plate splines were used, the optimal form of which was selected using generalized cross-validation. Random forest training was done using the R library randomForest. The number of trees was chosen to be 1000 based on the out-of-bag error plot for a random forest with five observations per leaf. The optimal number of observations in a random forest leaves was chosen using 10-fold cross-validation on a logarithmic grid out of 20 values from 5 to 2000.

In each example, 1000 realizations of different training and validation sets were performed, for each of which the errors were calculated. In each of the training and validation sets realizations, 5000 observations were generated. The results of the calculations are presented below in the boxplots, where every box represents the median and the 1st and 3rd quartiles. The plots do not show the results of linear regression, since in the examples, the results appeared to be significantly worse than those of the other models. The mean squared and maximum errors of ULL (27) were compared with the errors of LOESS estimator by the paired Wilcoxon test. The summaries of the errors on the 1000 realizations of different train and validation sets are reported as median (1st quartile, 3rd quartile).

The examples of this section were constructed so that the distribution of design points is “highly nonuniform”. Potentially, this could demonstrate the advantage of the new ULL estimator (27) over known estimation approaches.

Example 2.

Let us set the target function

f (z) = {(z - 5)}^{2} + 10, 0 \leq z \leq 10

(31)

and let the noise be centered Gaussian with standard deviation

σ = 2

(Figure 1). In each realization, we draw 4500 independent design points uniformly distributed on the segment

z \in [0, 5]

, and 500 independent design points uniformly distributed on the segment

z \in [5, 10]

.

Figure 1. Example 2. Sample observations, target function, and two estimators.

The results are presented in Figure 2. For the maximum error, the advantage of the estimators of order 1 (LOESS and ULL (27)) over the estimators of order 0 (the Nadaraya–Watson and ULC (28)) is noticeable, while ULL (27) turns out to be the best of all considered estimators, in particular, ULL (27) performs better than LOESS: 0.6357 (0.4993, 0.8224) vs. 0.6582 (0.5205, 0.8508), p = 0.019.

Figure 2. The maximum (left) and mean squared (right) errors in Example 2. For the mean squared error, the random forest model performed worse (10.97 (10.55, 11.39)) than the GAM model and the kernel estimators, so the results of the random forest model “did not fit” into the plot.

For the mean squared error, all models, except random forest and linear regression, show similar results. Moreover, ULL (27) turns out to be the best of the considered ones, although the difference between ULL (27) and LOESS is not statistically significant: 4.017 (3.896, 4.139) vs. 4.030 (3.906, 4.154), p = 0.11.

Example 3.

The piecewise linear target function is shown in Figure 3. For the sake of simplicity of presentation, we do not present the formula for the definition of this function. Here, the centered Gaussian noise has the standard deviation

σ = 2

. The design points are independent and identically distributed with density proportional to the function

{(z - 5)}^{2} + 2

,

0 \leq z \leq 10

.

Figure 3. Example 3. Sample observations, target function, and two estimators.

The results are presented in Figure 4. The Nadaraya–Watson estimator appears to be the best model both for the maximum error and for the mean squared error. For the both errors, ULL (27) is better than LOESS (p < 0.0001 for the maximum error, p = 0.0030 for the mean squared error).

Figure 4. The maximum (left) and mean squared (right) errors in Example 3. For the mean-squared error, the random forest model performed worse (6.699 (6.412, 7.046)) than the GAM model and the kernel estimators, so the results of the random forest model “did not fit” into the plot.

Example 4.

In this example, the design points are strongly dependent. We will define them as follows:

z_{i} : = s (A_{i})

,

i = 1, . . ., n

, where A is a positive number such that

A / π

is irrational (we chose

A = 0.0002

in this example),

s (t) : = 10 | \sum_{k = 1}^{100} η_{k} cos (t k) | w i t h η_{k} : = k^{- 1} ψ_{k} {(\sum_{j = 1}^{100} j^{- 1} ψ_{j})}^{- 1},

and

ψ_{j}

are independent uniformly distributed on

[0, 1]

random variables independent of the noise. It was shown [50] that the random sequence

s (A_{i})

is asymptotically everywhere dense on

[0, 10]

with probability 1.

The target function is

f (z) = 0.2 (({(z - 5)}^{2} + 25) cos ({(z - 5)}^{2} / 2) + 60),

shown in Figure 5.

Figure 5. Example 4. Sample observations, target function, and two estimators.

The results are presented in Figure 6. For maximum error, ULL (27) turns out to be the best of all the considered estimators. In particular, ULL (27) is better than LOESS: 1.757 (1.491, 2.053) vs. 2.538 (2.216, 2.886), p < 0.0001.

Figure 6. The maximum (left) and mean squared (right) errors in Example 4. As before, for the mean squared error, the results of the random forest model (13.95 (11.69, 16.18)) are not shown in full on the graph. In addition, the outliers for the GAM, NW, ULC, and ULL estimators are “cut off” in this graph.

The median mean squared error for ULL (27) also turns out to be the smallest of those considered. In that sense, ULL (27) is better than LOESS, but the difference is not significant: 4.166 (4.025, 4.751) vs. 4.219 (4.096, 4.338), p = 0.92.

Example 5.

In this example, the target function was the same as in Example 4. The difference from the previous example is that 50,000 design points were generated by the same technique, and then 5000 points of the 50,000 ones were selected. This allowed us to fill the domain of f with design elements “more uniformly” than in the previous example, while preserving the clusters of design points.

The results are presented in Figure 7. For maximum error, ULL (27) turns out to be the best of all the considered estimators. In particular, ULL (27) is better than LOESS: 2.872 (2.369, 3.488) vs. 9.435 (5.719, 10.9),

p < 0.0001

.

Figure 7. The maximum (left) and mean-squared (right) errors in Example 5. As before, for the mean-squared error, the results of the random forest model are not shown in full on the graph. In addition, the outliers for the NW, ULC, and ULL estimators are “cut off” in this graph.

For the mean squared error, the best estimator is LOESS. ULL (27) is worse than LOESS: 5.108 (4.535, 6.597) vs. 4.378 (4.229, 4.541),

p < 0.0001

, but it is better than the other estimators considered.

6. Real Data Application

In this section, we consider an application of the models considered in the previous section to the data collected in the multicenter study “Epidemiology of cardiovascular diseases in the regions of the Russian Federation”. In that study, representative samples of unorganized male and female populations aged 25–64 years from 13 regions of the Russian Federation were studied. The study was approved by the Ethics Committees of the three federal centers: State Research Center for Preventive Medicine, Russian Cardiology Research and Production Complex, Almazov Federal Medical Research Center. Each participant provided written informed consent for the study. The study was described in detail in [60].

One of the urgent problems of modern medicine is to study the relationship between heart rate (HR) and systolic arterial blood pressure (SBP), especially for low observation values. Therefore we will choose SBP as the outcome, and HR as the predictor. The association between these variables was previously estimated to be nonlinear [61]. The general analysis included 6597 participants from four regions of the Russian Federation. The levels of SBP and HR were statistically significantly pairwise different between the selected regions. Thus, the hypothesis of the independence of design points was violated.

In this section, the maximum error cannot be calculated because the exact form of the relationship is unknown, so only the mean squared error is reported. The mean squared error was calculated for 1000 random partitions of the entire set of observations into training (

80 %

) and validation (

20 %

) samples.

The results are presented in Figure 8. Here, the GAM estimator and the kernel estimators showed similar results, which were better than the results of both the linear regression and random forest.

Figure 8. Mean-squared prediction error of the dependence of BP from HR.

The best estimator turned out to be ULC (28), although its difference from the Nadaraya–Watson estimator was not statistically significant: 220.2 (215.4, 225.9) vs. 220.4 (215.4, 225.8),

p = 0.91

. The difference between ULL (27) and LOESS was not significant too: 220.4 (215.4, 225.9) vs. 220.6 (215.6, 226.1),

p = 0.52

.

7. Conclusions

In this paper, for a wide class of nonparametric regression models with a random design, universal uniformly consistent kernel estimators are proposed for an unknown random regression function of a scalar argument. These estimators belong to the class of local linear estimators. However, in contrast to the vast majority of previously known results, traditional conditions of dependence of design elements are not needed for the consistency of the new estimators. The design can be either fixed and not necessarily regular, or random and not necessarily consisting of independent or weakly dependent random variables. With regard to design elements, the only condition that is required is the dense filling of the regression function domain with the design points.

Explicit upper bounds are found for the rate of uniform convergence in probability of the new estimators to an unknown random regression function. The only characteristic explicitly included in these estimators is the maximum spacing statistic of the variational series of design elements, which requires only the convergence to zero in probability of the maximum spacing as the sample size tends to infinity. The advantage of this condition over the classical ones is that it is insensitive to the forms of dependence of the design observations. Note that this condition is, in fact, necessary, since only when the design densely fills the regression function domain is it possible to reconstruct the regression function with some accuracy. As a corollary of the main result, we obtain consistent estimators for the mean function of continuous random processes.

In the simulation examples of Section 5, the new estimators were compared with known kernel estimators. In some of the examples, the new estimators proved to be the most accurate. In the application to real medical data considered in Section 6, the accuracy of new estimators was also comparable with that of the best-known kernel estimators.

8. Proofs

In this Section, we will prove the assertions stated in Section 2, Section 3 and Section 4. Denote

β_{n, i} (t) : = \frac{w_{n 2} (t) - (t - z_{n : i}) w_{n 1} (t)}{w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)} .

(32)

Taking into account the relations

X_{n i} = f (z_{n : i}) + ε_{n i}

,

i = 1, \dots, n,

and the identity

\sum_{i = 1}^{n} β_{n, i} (t) K_{h} (t - z_{n : i}) Δ z_{n i} \equiv 1,

(33)

we obtain the representation

{\hat{f}}_{n, h} (t) = f (t) + f (t) I (δ_{n} > c_{*} h) + {\hat{r}}_{n, h} (f, t) + {\hat{ν}}_{n, h} (t),

(34)

where

\begin{matrix} {\hat{r}}_{n, h} (f, t) = I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} β_{n, i} (t) (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i}, \\ {\hat{ν}}_{n, h} (t) = I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} β_{n, i} (t) K_{h} (t - z_{n : i}) Δ z_{n i} ε_{n i} . \end{matrix}

We emphasize that, in view of the properties of the density

K_{h} (\cdot)

, the domain of summation in the last two sums as well as in all sums defining the quantities

w_{n j} (t)

from (4) coincides with the set

A_{n, h} (t) = {i : | t - z_{n : i} | \leq h, 1 \leq i \leq n}

, which is a crucial point for further analysis.

Lemma 1.

For

h < 1 / 2

, the following equalities are valid:

\begin{matrix} inf_{t \in [0, 1]} (w_{0} (t) w_{2} (t) - w_{1}^{2} (t)) = \frac{1}{4} (κ_{2} - κ_{1}^{2}) h^{2}, inf_{t \in [0, 1]} w_{0} (t) = 1 / 2, \end{matrix}

(35)

\begin{matrix} sup_{t \in [0, 1]} | w_{j} (t) | = {(\frac{1}{2})}^{j - 2 [j / 2]} κ_{j} h^{j}, j = 0, 1, 2, 3 . \end{matrix}

(36)

Moreover, on the set of elementary events such that

δ_{n} \leq c_{*} h

, the following inequalities hold:

\begin{matrix} sup_{t \in [0, 1]} | w_{n j} (t) | \leq 3 L h^{j}, sup_{t \in [0, 1]} | w_{n j} (t) - w_{j} (t) | \leq 12 L δ_{n} h^{j - 1}, j = 0, 1, 2, 3, \end{matrix}

(37)

\begin{matrix} inf_{t \in [0, 1]} (w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)) \geq \frac{1}{8} (κ_{2} - κ_{1}^{2}) h^{2}, inf_{t \in [0, 1]} w_{n 0} (t) \geq 1 / 4, \end{matrix}

(38)

\begin{matrix} \forall t_{1}, t_{2} \in [0, 1] | w_{n j} (t_{2}) - w_{n j} (t_{1}) | \leq 18 L h^{j - 1} | t_{2} - t_{1} |, j = 0, 1, 2 . \end{matrix}

(39)

Proof.

Let us prove (35) and (36). First of all, note that, due to the Cauchy–Bunyakovsky–Schwartz inequality,

w_{0} (t) w_{2} (t) - w_{1}^{2} (t) \geq 0

for all

t \in [0, 1]

and this difference is continuous in t. First, consider the simplest case where

h \leq t \leq 1 - h

. For such t, after changing the integration variable in the definition (21) of the quantities

w_{j} (t)

we have

w_{j} (t) = \int_{t - h}^{t + h} {(t - z)}^{j} K_{h} (t - z) d z = h^{j} \int_{- 1}^{1} v^{j} K (v) d v,

(40)

i.e.,

w_{0} (t) \equiv 1

,

w_{1} (t) \equiv 0

, and

w_{2} (t) \equiv h^{2} κ_{2}

. In other words, on the segment

[h, 1 - h]

, the following identity is valid:

w_{0} (t) w_{2} (t) - w_{1}^{2} (t) \equiv h^{2} κ_{2} .

(41)

We now consider the case

t = α h

for all

α \in [0, 1]

. Then

w_{j} (α h) = \int_{0}^{(1 + α) h} {(α h - z)}^{j} K_{h} (α h - z) d z = h^{j} κ_{j} (α) .

(42)

Next, by (42), we obtain

\begin{matrix} \frac{d}{d α} h^{- 2} (w_{0} (α h) w_{2} (α h) - w_{1}^{2} (α h)) = \frac{d}{d α} (κ_{0} (α) κ_{2} (α) - κ_{1}^{2} (α)) \\ = K (α) (α^{2} \int_{- 1}^{α} K (v) d v + \int_{- 1}^{α} v^{2} K (v) d v - 2 α \int_{- 1}^{α} v K (v) d v) \geq 0 \end{matrix}

in view of the relation

\int_{- 1}^{α} v K (v) d v \leq 0

since

K (v)

is an even function. Similarly we study the symmetrical case where

t = 1 - α h

for all

α \in [0, 1]

. From here and (41) we obtain the first relation in (35):

inf_{t \in [0, 1]} {w_{0} (t) w_{2} (t) - w_{1}^{2} (t)} = w_{0} (0) w_{2} (0) - w_{1}^{2} (0) = \frac{1}{4} h^{2} (κ_{2} - κ_{1}^{2}) .

The second relation in (35) directly follows from (42). Moreover, the above-mentioned arguments and the representations (40) and (42) imply (36).

Further, the first estimator in (37) is obvious by the above remark about the domain of summation in the definition of functions

w_{n j} (t)

, and the relations

sup_{s \in [0, 1]} K (s) \leq L, \sum_{i \in A_{n, h} (t)} Δ z_{n i} \leq 2 h + δ_{n} \leq 3 h .

(43)

The second estimator in (37) immediately follows from the well-known estimate of the error of approximation by Riemann integral sums of the corresponding integrals of smooth functions on a finite closed interval:

| \sum_{i \in A_{n, h} (t)} g_{t, j} (z_{n : i}) Δ z_{n i} - \int_{z \in [0, 1] : | t - z | \leq h} g_{t, j} (z) d z | \leq (2 h + δ_{n}) δ_{n} L_{g_{t, j}},

(44)

where the functions

g_{t, j} (z) = {(t - z)}^{j} K_{h} (t - z)

,

j = 0, 1, 2, 3

, are defined for all

z \in [0 \lor t - h, 1 \land t + h]

, and

L_{g_{t, j}}

is the Lipschitz constant of the function

g_{t, j} (z)

; It easy to verify that

{sup}_{t \in [0, 1]} L_{g_{t, j}} \leq 4 L h^{j - 2}

for all

h \in (0, 1 / 2)

and

j = 0, 1, 2, 3

. So, on the set of elementary events such that

{δ_{n} \leq c_{*} h}

(recall that

c_{*} < 1

), the right-hand side in (44) can be replaced with

12 L δ_{n} h^{j - 1}

.

In addition, taking (36) and (37) into account, we obtain

\begin{matrix} | w_{n 0} (t) w_{n 2} (t) - & w_{0} (t) w_{2} (t) | \\ \leq w_{n 0} (t) | w_{n 2} (t) - w_{2} (t) | + w_{2} (t) | w_{n 0} (t) - w_{0} (t) | \leq 9 L δ_{n} (3 L + κ_{2}) h, \end{matrix}

| w_{n 1}^{2} (t) - w_{1}^{2} (t) | \leq | w_{n 1} (t) - w_{1} (t) | (| w_{n 1} (t) | + | w_{1} (t) |) \leq 9 L δ_{n} (3 L + κ_{1} / 2) h .

Hence follows the estimate

| w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t) - w_{0} (t) w_{2} (t) + w_{1}^{2} (t) | \leq 9 L δ_{n} (6 L + κ_{2} + κ_{1} / 2) h .

(45)

The inequalities in (38) follow from (35), (45), and the definition of the constant

c_{*}

. To prove (39), note that

w_{n j} (t_{2}) - w_{n j} (t_{1}) = \sum_{i = 1}^{n} \{{(t_{2} - z_{n : i})}^{j} K_{h} (t_{2} - z_{n : i}) - {(t_{1} - z_{n : i})}^{j} K_{h} (t_{1} - z_{n : i})\} Δ z_{n i}

= \sum_{i \in A_{n, h} (t_{1}) \cup A_{n, h} (t_{2})} \{{(t_{2} - z_{n : i})}^{j} K_{h} (t_{2} - z_{n : i}) - {(t_{1} - z_{n : i})}^{j} K_{h} (t_{1} - z_{n : i})\} Δ z_{n i}

where we can use the estimates

| {(t_{2} - z_{n : i})}^{j} - {(t_{1} - z_{n : i})}^{j} | \leq 2 h^{j - 1} | t_{2} - t_{1} |

for

j = 0, 1, 2

,

| t_{k} - z_{n : i} | \leq h

for

k = 1, 2

, and also the inequalities

| K_{h} (t_{2} - z_{n : i}) - K_{h} (t_{1} - z_{n : i}) | \leq L h^{- 2} | t_{2} - t_{1} |,

\sum_{i \in A_{n, h} (t_{1}) \cup A_{n, h} (t_{2})} Δ z_{n i} \leq 4 h + 2 δ_{n} \leq 6 h .

(46)

Thus, Lemma 1 is proved. □

Lemma 2.

For any positive

h < 1 / 2

, the following estimate is valid:

\begin{matrix} sup_{t \in [0, 1]} | {\hat{r}}_{n, h} (f, t) | \leq C_{1}^{*} ω_{f} (h), w i t h C_{1}^{*} = C_{1} \frac{L^{2}}{κ_{2} - κ_{1}^{2}} . \end{matrix}

Proof.

Without loss of generality, the required estimate can be derived on the set of elementary events determined by the condition

δ_{n} \leq c_{*} h

. Then, the assertion of the lemma follows from the inequality

\begin{matrix} | {\hat{r}}_{n, h} (f, t) | \leq \frac{ω_{f} (h) w_{n 2} (t)}{w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)} \sum_{i \in A_{n, h} (t)} K_{h} (t - z_{n : i}) Δ z_{n i} \\ + \frac{ω_{f} (h) | w_{n 1} (t) |}{w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)} \sum_{i \in A_{n, h} (t)} | t - z_{n : i} | K_{h} (t - z_{n : i}) Δ z_{n i}, \end{matrix}

(47)

the estimates from (43), and Lemma 1. □

Lemma 3.

For any

y > 0

and

h < 1 / 2

, on the set of elementary events such that

δ_{n} \leq c_{*} h

, the following estimate is valid:

\begin{matrix} P_{F_{n}} (sup_{t \in [0, 1]} | {\hat{ν}}_{n, h} (t) | > y) \leq C_{2}^{*} σ^{2} \frac{δ_{n}}{h^{2} y^{2}}, w i t h C_{2}^{*} = C_{2} \frac{L^{4}}{{(κ_{2} - κ_{1}^{2})}^{2}}, \end{matrix}

where the symbol

P_{F_{n}}

denotes the conditionsl probability given the σ-field

F_{n}

.

Proof.

Put

\begin{matrix} μ_{n, h} (t) = \sum_{i \in A_{n, h} (t)} h^{- 2} α_{n, i} (t) K_{h} (t - z_{n : i}) Δ z_{n i} ε_{n i}, \end{matrix}

(48)

where

α_{n, i} (t) = w_{n 2} (t) - (t - z_{n : i}) w_{n 1} (t)

, and note that from Lemma 1 and the conditions of Lemma 3 it follows that, firstly,

h^{- 2} | α_{n, i} (t) | \leq 6 L

if only

i \in A_{n, h} (t)

, and secondly,

\begin{matrix} \begin{matrix} | {\hat{ν}}_{n, h} (t) | \leq 8 {(κ_{2} - κ_{1}^{2})}^{- 1} | μ_{n, h} (t) | . \end{matrix} \end{matrix}

(49)

The distribution tail of the random variable

{sup}_{t \in [0, 1]} | μ_{n, h} (t) |

will be estimated by the so-called chaining proposed by A.N. Kolmogorov to estimate the distribution tail of the supremum norm of a stochastic process with almost surely continuous trajectories (see [62]). First of all, note that the set

[0, 1]

under the supremum sign can be replaced by the set of dyadic rational points

R = {j / 2^{k}; j = 1, \dots, 2^{k} - 1; k \geq 1} .

Thus,

\begin{matrix} sup_{t \in [0, 1]} | μ_{n, h} (t) | = sup_{t \in R} | μ_{n, h} (t) | \leq max_{j = 1, . . ., 2^{m} - 1} | μ_{n, h} (j 2^{- m}) | \\ + \sum_{k = m + 1}^{\infty} max_{j = 1, . . ., 2^{k} - 2} | μ_{n, h} ((j + 1) 2^{- k}) - μ_{n, h} (j 2^{- k}) |, \end{matrix}

where the natural number m is defined by the equality

m = ⌈ | {log}_{2} h | ⌉

(here

⌈ a ⌉

is the minimal natural number greater than or equal to a. One has

\begin{matrix} P_{F_{n}} (sup_{t \in [0, 1]} | μ_{n, h} (t) | > y) \leq P_{F_{n}} (max_{j = 1, . . ., 2^{m} - 1} | μ_{n, h} (j 2^{- m}) | > a_{m} y) \\ + \sum_{k = m + 1}^{\infty} P_{F_{n}} (max_{j = 1, . . ., 2^{k} - 2} | μ_{n, h} ((j + 1) 2^{- k}) - μ_{n, h} (j 2^{- k}) | > a_{k} y) \\ \leq \sum_{j = 1}^{2^{m} - 1} P_{F_{n}} (| μ_{n, h} (j 2^{- m}) | > a_{m} y) \\ + \sum_{k = m + 1}^{\infty} \sum_{j = 1}^{2^{k} - 2} P_{F_{n}} (| μ_{n, h} ((j + 1) 2^{- k}) - μ_{n, h} (j 2^{- k}) | > a_{k} y), \end{matrix}

(50)

where

a_{m}, a_{m + 1}, . . .

is a sequence of positive numbers such that

a_{m} + a_{m + 1} + . . . = 1

.

Let us now estimate each of the terms on the right-hand side of (50). Using Markov’s inequality for the second moment and the estimates (43), we obtain

\begin{matrix} P_{F_{n}} (| μ_{n, h} (j 2^{- m}) | > a_{m} y) \leq \frac{{(6 L)}^{2}}{{(a_{m} y)}^{2}} \sum_{i \in A_{n, h} (j 2^{- m})} K_{h}^{2} (j 2^{- m} - z_{n : i}) {(Δ z_{n i})}^{2} σ^{2} \\ \leq {(6 L)}^{2} σ^{2} {(a_{m} y)}^{- 2} δ_{n} (2 h + δ_{n}) h^{- 2} \leq C_{3} L^{2} σ^{2} {(a_{m} y)}^{- 2} δ_{n} h^{- 1} . \end{matrix}

(51)

Further,

\begin{matrix} P_{F_{n}} (| μ_{n, h} ((j + 1) 2^{- k}) - μ_{n, h} (j 2^{- k}) | > a_{k} y) \leq {(a_{k} y)}^{- 2} h^{- 4} \\ \times \sum_{i = 1}^{n} E_{F} {((α_{n, i} ((j + 1) 2^{- k}) K_{h} ((j + 1) 2^{- k} - z_{n : i}) - α_{n, i} (j 2^{- k}) K_{h} (j 2^{- k} - z_{n : i})) Δ z_{n i} ε_{n i})}^{2} \\ \leq σ^{2} {(a_{k} y)}^{- 2} h^{- 4} \\ \times \sum_{i = 1}^{n} {(α_{n, i} ((j + 1) 2^{- k}) K_{h} ((j + 1) 2^{- k} - z_{n : i}) - α_{n, i} (j 2^{- k}) K_{h} (j 2^{- k} - z_{n : i}))}^{2} {(Δ z_{n i})}^{2} \\ \leq L h^{- 2} | u - v | \leq C_{4} σ^{2} L^{4} {(a_{k} y)}^{- 2} 2^{- 2 k} δ_{n} (4 h + 2 δ_{n}) h^{- 4} \leq C_{5} σ^{2} L^{4} {(a_{k} y)}^{- 2} 2^{- 2 k} δ_{n} h^{- 3} . \end{matrix}

(52)

Here, we took into account that the summation range in (52) coincides with the set

\{i : i \in A_{n, h} ((j + 1) 2^{- k}) \cup A_{n, h} (j 2^{- k})\},

and hence, due to the relation

| (j + 1) / 2^{k} - j / 2^{k} | = 2^{- k} \leq h

for

k > m

, the estimate (46) is valid for

t_{1} = j 2^{- k}

and

t_{2} = (j + 1) 2^{- k}

. Moreover, we used the estimates

sup_{t} K_{h} (t) \leq L h^{- 1}, | K_{h} (u) - K_{h} (v) | \leq L h^{- 2} | u - v |,

and took into account the following inequalities in the above range of parameter changes (see Lemma 1):

\begin{matrix} | α_{n, i} ((j + 1) 2^{- k}) - α_{n, i} (j 2^{- k}) | \leq C_{6} L h 2^{- k}, | α_{n, i} (j 2^{- k}) | \leq C_{7} L h^{2}, \\ | α_{n, i} ((j + 1) 2^{- k}) K_{h} ((j + 1) 2^{- k} - z_{n : i}) - α_{n, i} (j 2^{- k}) K_{h} (j 2^{- k} - z_{n : i}) | \leq C_{8} L 2^{- k} . \end{matrix}

We now obtain from (50)–(52) that

P_{F_{n}} (sup_{t \in [0, 1]} | μ_{n, h} (t) | > y) \leq C_{9} y^{- 2} σ^{2} L^{4} δ_{n} h^{- 1} (2^{m} a_{m}^{- 2} + h^{- 2} \sum_{k = m + 1}^{\infty} 2^{- k + 1} a_{k}^{- 2}) .

The optimal sequence

a_{k}

minimizing the right-hand side of this inequality is

a_{m} = c 2^{m / 3}

and

a_{k} = c h^{- 2 / 3} 2^{(- k + 1) / 3}

for

k = m + 1, m + 2, . . .

, where c is defined by the relation

a_{m} + a_{m + 1} + . . . = 1

. For the indicated sequence, we conclude that

\begin{matrix} P_{F_{n}} (sup_{t \in [0, 1]} | μ_{n, h} (t) | > y) \\ \leq C_{10} y^{- 2} σ^{2} L^{4} δ_{n} h^{- 1} {(2^{m / 3} + h^{- 2 / 3} 2^{- m / 3} (2 + 2^{1 / 3} + 2^{2 / 3}))}^{3} \leq C_{11} y^{- 2} σ^{2} L^{4} δ_{n} h^{- 2} . \end{matrix}

The assertion of the lemma follows from (49). □

Proof of Theorem 1.

The assertion follows from Lemmas 2 and 3 if we set

ζ_{n} (h) = sup_{t \in [0, 1]} | {\hat{ν}}_{n, h} (t) | + sup_{t \in [0, 1]} | f (t) | I (δ_{n} > c_{*} h)

and take into account the relation

P (ζ_{n} (h) > y, δ_{n} \leq c_{*} h) = E I (δ_{n} \leq c_{*} h) P_{F_{n}} (ζ_{n} (h) > y),

which was required. □

To prove Theorem 2 we need the two auxiliary assertions below.

Lemma 4.

If the condition(16)is fulfilled then

{lim}_{ε \to 0} E ω_{f} (ε) = 0

and for independent copies of the a.s. continuous random process

f (t)

the following strong law of large numbers is valid: As

N \to \infty

, then

sup_{t \in [0, 1]} | {\bar{f}}_{N} (t) - E f (t) | \overset{p}{\to} 0, w h e r e {\bar{f}}_{N} (t) = N^{- 1} \sum_{j = 1}^{N} f_{j} (t) .

(53)

Proof.

The first assertion of the lemma follows from (16) and Lebesgue’s dominated convergence theorem. We put

ω_{{\bar{f}}_{N}} (ε) = sup_{t, s : | t - s | \leq ε} | {\bar{f}}_{N} (t) - {\bar{f}}_{N} (s) |, ω_{E f} (ε) = sup_{t, s : | t - s | \leq ε} | E f (t) - E f (s) | .

For any fixed

k > 0

and

i = 0, \dots, k

, one has

\begin{matrix} sup_{t \in [0, 1]} | {\bar{f}}_{N} (t) - E f (t) | \leq max_{0 \leq i \leq k} |{\bar{f}}_{N} (i / k) - E f (i / k)| \\ + max_{1 \leq i \leq k} sup_{(i - 1) / k \leq t \leq i / k} |{\bar{f}}_{N} (t) - {\bar{f}}_{N} (i / k)| + max_{1 \leq i \leq k} sup_{(i - 1) / k \leq t \leq i / k} |E f (t) - E f (i / k)| \\ \leq max_{0 \leq i \leq k} |{\bar{f}}_{N} (i / k) - E f (i / k)| + ω_{{\bar{f}}_{N}} (1 / k) + ω_{E f} (1 / k) . \end{matrix}

(54)

Put

ω_{f_{j}} (ε) = sup_{t, s : | t - s | \leq ε} | f_{j} (t) - f_{j} (s) |

and note that

ω_{E f} (ε) \leq E ω_{f} (ε)

, and as

N \to \infty

,

{\bar{f}}_{N} (i / k) \overset{p}{\to} E f (i / k), ω_{{\bar{f}}_{N}} (ε) \leq \frac{1}{N} \sum_{j = 1}^{N} ω_{f_{j}} (ε) \overset{p}{\to} E ω_{f} (ε) .

Therefore, the right-hand side in (54) does not exceed

E ω_{f} (1 / k) + o_{p} (1)

and by the arbitrariness of k and the first statement of the lemma, the relation (53) is proved. □

Lemma 5.

Under the conditions of Theorem 2 the following limit relation holds:

\frac{1}{N} \sum_{j = 1}^{N} Δ_{n, h, j} \overset{p}{\to} 0, w h e r e Δ_{n, h, j} = sup_{t \in [0, 1]} | f_{n, h, j}^{*} (t) - f_{j} (t) | .

(55)

Proof.

Let the sequences

h = h_{n} \to 0

and

N = N_{n} \to \infty

be such that condition (17). Introduce the event

B_{n, h, j} = {δ_{n, j} \leq c_{*} h}

, where

j = 1, \dots, N

. For any positive

ν

one has

P \{\frac{1}{N} \sum_{j = 1}^{N} Δ_{n, h, j} > ν\} \leq P \{\frac{1}{N} \sum_{j = 1}^{N} Δ_{n, h, j} I (B_{n, h, j}) > ν\} + N P (\bar{B_{n, h, 1}}) .

(56)

Next, from Theorem 1 we obtain

\begin{matrix} E Δ_{n, h, j} I (B_{n, h, j}) \leq C_{1}^{*} E ω_{f} (h) + \int_{0}^{\infty} P (ζ_{n} (h) > y, δ_{n} \leq c_{*} h) d y \\ \leq C_{1}^{*} E ω_{f} (h) + h^{- 1} {(E δ_{n})}^{1 / 2} + \int_{h^{- 1} {(E δ_{n})}^{1 / 2}}^{\infty} P (ζ_{n} (h) > y, δ_{n} \leq c_{*} h) d y \\ \leq C_{1}^{*} E ω_{f} (h) + (1 + C_{2}^{*} σ^{2}) h^{- 1} {(E δ_{n})}^{1 / 2} . \end{matrix}

To complete the proof of the lemma, it remains for the first probability on the right-hand side of (56) to apply Markov’s inequality, use the last estimate, limit relations (17), and the first statement of Lemma 4. □

Proof of Theorem 2.

The proof of Theorem 2 follows from Lemmas 4 and 5. □

Proof of Proposition 2.

For the estimator

f_{n, h}^{*} (t)

defined in (19), we need the following representation:

f_{n, h}^{*} (t) = f (t) + r_{n, h}^{*} (f, t) + ν_{n, h}^{*} (t),

(57)

where

\begin{matrix} r_{n, h}^{*} (f, t) = w_{n 0}^{- 1} (t) \sum_{i = 1}^{n} (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i}, \\ ν_{n, h}^{*} (t) = w_{n 0}^{- 1} (t) \sum_{i = 1}^{n} K_{h} (t - z_{n : i}) Δ z_{n i} ε_{n i} . \end{matrix}

In view of the representations (34) and (57), we obtain

\begin{matrix} Bias {\hat{f}}_{n, h} (t) = E {\hat{r}}_{n, h} (f, t) + f (t) P (δ_{n} > c_{*} h) \\ = \sum_{i = 1}^{n} E {I (δ_{n} \leq c_{*} h) β_{n, i} (t) (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i}} + f (t) P (δ_{n} > c_{*} h), \\ Bias f_{n, h}^{*} (t) = E r_{n, h}^{*} (f, t) \end{matrix}

(58)

\begin{matrix} = \sum_{i = 1}^{n} E {I (δ_{n} \leq c_{*} h) w_{n 0}^{- 1} (t) (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i}} + τ_{n}, \end{matrix}

(59)

where

| τ_{n} | \leq ω_{f} (h) P (δ_{n} > c_{*} h)

. Further, it follows from Lemma 1 that, under the condition

δ_{n} \leq c_{*} h

, for any point

t \in [h, 1 - h]

one has

sup_{i \in A_{n, h} (t)} | β_{n, i} (t) - w_{n 0}^{- 1} (t) | \leq C_{5}^{*} δ_{n} h^{- 1} .

(60)

When deriving the relation (60), we also took into account that

w_{0} (t) = 1

and

w_{1} (t) = 0

for all

t \in [h, 1 - h]

(see the proof of Lemma 1). Now, reckoning with the relations (43), (58), (59), (60), and Lemma 1, it easy to derive the first assertion of the lemma since

\begin{matrix} | Bias {\hat{f}}_{n, h} (t) - Bias f_{n, h}^{*} (t) | \leq C_{5}^{*} h^{- 1} ω_{f} (h) E \{δ_{n} I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} K_{h} (t - z_{n : i}) Δ z_{n i}\} \\ + (| f (t) | + ω_{f} (h)) P (δ_{n} > c_{*} h) \leq C_{6}^{*} ω_{f} (h) h^{- 1} E δ_{n} + (| f (t) | + ω_{f} (h)) P (δ_{n} > c_{*} h) . \end{matrix}

(61)

To prove the second assertion, first of all, note that

\begin{matrix} V a r {\hat{f}}_{n, h} (t) = V a r {\hat{ν}}_{n, h} (t) & + V a r ({\hat{r}}_{n, h} (f, t) + f (t) I (δ_{n} > c_{*} h)) \\ = V a r {\hat{ν}}_{n, h} (t) + V a r {\hat{r}}_{n, h} (f, t) + f^{2} (t) P (δ_{n} > c_{*} h) P (δ_{n} \leq c_{*} h), \end{matrix}

V a r f_{n, h}^{*} (t) = V a r ν_{n, h}^{*} (t) + V a r r_{n, h}^{*} (f, t) .

Thus, we need to compare the two variances on the right-hand side of the first equality with the corresponding variances of the second one. Using (43) and (60), we obtain

\begin{matrix} | V a r {\hat{ν}}_{n, h} (t) - V a r ν_{n, h}^{*} (t) | \leq σ^{2} |E \sum_{i = 1}^{n} I (δ_{n} \leq c_{*} h) (β_{n, i}^{2} (t) - w_{n 0}^{- 2} (t)) K_{h}^{2} (t - z_{n : i}) {(Δ z_{n i})}^{2}| \\ + σ^{2} P (δ_{n} > c_{*} h) \leq C_{7}^{*} σ^{2} h^{- 1} E \{δ_{n} I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} h K_{h}^{2} (t - z_{n : i}) Δ z_{n i}\} \\ + σ^{2} P (δ_{n} > c_{*} h) \leq C_{8}^{*} σ^{2} h^{- 1} E δ_{n}; \end{matrix}

when deriving this estimate, we took into account that

\sum_{i = 1}^{n} w_{n 0}^{- 2} (t)) K_{h}^{2} (t - z_{n : i}) {(Δ z_{n i})}^{2} \leq 1 .

To estimate the difference

| V a r {\hat{r}}_{n, h} (t) - V a r r_{n, h}^{*} (t) |

, note that the bound

C_{9}^{*} {\bar{f}}^{2} h^{- 1} E δ_{n}

for the modulus of the difference between the squares of the displacements of the random variables

{\hat{r}}_{n, h} (f, t)

and

r_{n, h}^{*} (f, t)

is essentially contained in (47) and (61). Estimation of the difference of the second moments of the specified random variables is done similarly with (43), (60), and (61):

\begin{matrix} | E {\hat{r}}_{n, h}^{2} (f, t) - E r_{n, h}^{* 2} (f, t) | \leq E | {\hat{r}}_{n, h} (f, t) - r_{n, h}^{*} (f, t) | | {\hat{r}}_{n, h} (f, t) + r_{n, h}^{*} (f, t) | \leq C_{10}^{*} {\bar{f}}^{2} h^{- 1} E δ_{n}, \end{matrix}

which completes the proof. □

Proof of Proposition 3.

From the definition of

β_{n, i} (t)

in (32) it follows that, for any

t \in [0, 1]

,

\begin{matrix} \sum_{i = 1}^{n} β_{n, i} (t) (z_{n : i} - t) K_{h} (t - z_{n : i}) Δ z_{n i} = 0, \\ \sum_{i = 1}^{n} β_{n, i} (t) {(z_{n : i} - t)}^{2} K_{h} (t - z_{n : i}) Δ z_{n i} = D_{n}^{- 1} (t) (w_{n 2}^{2} (t) - w_{n 3} (t) w_{n 1} (t)) = : B_{n} (t), \end{matrix}

where

D_{n} (t) : = w_{n 0} (t) w_{n 2} (t) - w_{n 1}^{2} (t)

. Expanding the function

f (\cdot)

by the Taylor formula in a neighborhood of the point t (up to the second derivative), from the above identities we obtain, using (32), (58), and Lemma 1, that for any point t we have

\begin{matrix} Bias {\hat{f}}_{n, h} (t) = E I (δ_{n} \leq c_{*} h) \sum_{i = 1}^{n} {β_{n, i} (t) (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i}} + f (t) P (δ_{n} > c_{*} h) \\ = \frac{f^{″} (t)}{2} E I (δ_{n} \leq c_{*} h) B_{n} (t) + f (t) P (δ_{n} > c_{*} h) + o (h^{2}) \\ = \frac{f^{″} (t)}{2} B_{0} (t) + O (E δ_{n} / h) + o (h^{2}); \end{matrix}

(62)

moreover, the O- and o-symbols on the right-hand side of (62) are uniform in t. Note that

B_{0} (t) = O (h^{2})

holds for any t.

Next, since for

j = 1, 2

we have

| w_{j} (t) | w_{0}^{- 1} (t) \leq h^{j}

and

| w_{n j} (t) | w_{n 0}^{- 1} (t) \leq h^{j}

for all natural n, the following asymptotic representation holds:

\begin{matrix} Bias f_{n, h}^{*} (t) = \sum_{i = 1}^{n} E w_{n 0}^{- 1} (t) (f (z_{n : i}) - f (t)) K_{h} (t - z_{n : i}) Δ z_{n i} \\ = - f^{'} (t) E \frac{w_{n 1} (t)}{w_{n 0} (t)} I (δ_{n} \leq c_{*} h) + \frac{f^{″} (t)}{2} E \frac{w_{n 2} (t)}{w_{n 0} (t)} I (δ_{n} \leq c_{*} h) + O (h P (δ_{n} > c_{*} h)) + o (h^{2}) \\ = - f^{'} (t) \frac{w_{1} (t)}{w_{0} (t)} + \frac{f^{″} (t)}{2} \frac{w_{2} (t)}{w_{0} (t)} + O (E δ_{n}) + o (h^{2}) . \end{matrix}

(63)

□

Proof of Corollary 3.

Without loss of generality, we can assume that

t \in [h, 1 - h]

. Then, as noted in the proof of Lemma 1, for the indicated t, one has

w_{0} (t) = 1

,

w_{1} (t) = 0

, and

w_{2} (t) = κ_{2} h^{2}

, i.e.,

B_{0} (t) = κ_{2} h^{2}

. □

Proof of Corollary 4.

This assertion follows from Proposition 3 and (42). □

Author Contributions

Conceptualization, Y.L. and E.Y.; data curation, V.K. and S.S.; formal analysis, P.R. and V.K.; investigation, S.S.; software, P.R. and V.K.; methodology, Y.L, I.B., P.R., V.K. and E.Y.; visualization, P.R.; writing—original draft, Y.L., I.B. and P.R.; writing—(review and editing), Y.L., I.B., E.Y., P.R. and V.K. All authors have read and agreed to the published version of the manuscript.

Funding

The study of Yu. Linke, I. Borisov, and P. Ruzankin was supported within the framework of the state contract of the Sobolev Institute of Mathematics, project FWNF-2022-0009.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the three ethics committees: National Medical Research Center for Therapy and Preventive Medicine, Russian Cardiology Research-and-Production Complex, and Federal Almazov North-West Medical Research Centre.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data cannot be shared publicly because of Ethics committee of the National Medical Research Center for Therapy and Preventive Medicine regulations. Deidentified data will be provided to any qualified investigator on reasonable request. Proposals will be reviewed and approved by the researchers, local regulatory authorities, and the ethics committee of the National Medical Research Center for Therapy and Preventive Medicine. Once the proposal has been approved, data can be transferred through a secure online platform after the signing of a data access agreement and a confidentiality agreement.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fan, J.; Gijbels, I. Local Polynomial Modelling and Its Applications; Chapman and Hall: London, UK, 1996. [Google Scholar]
Fan, J.; Yao, Q. Nonlinear Time Series Nonparametric and Parametric Methods; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Härdle, W. Applied Nonparametric Regression; Cambridge University Press: Cambridge, MA, USA, 1990. [Google Scholar]
Müller, H.-G. Nonparametric Regression Analysis of Longitudinal Data; Springer: New York, NY, USA, 1988. [Google Scholar]
Chu, C.K.; Deng, W.-S. An interpolation method for adapting to sparse design in multivariate nonparametric regression. J. Statist. Plann. Inference 2003, 116, 91–111. [Google Scholar] [CrossRef]
Devroye, L.P. The uniform convergence of the Nadaraya–Watson regression function estimate. Can. J. Stat. 1979, 6, 179–191. [Google Scholar] [CrossRef]
Gasser, T.; Engel, J. The choice of weghts in kernel regression estimation. Biometrica 1990, 77, 277–381. [Google Scholar] [CrossRef]
Hansen, B.E. Uniform convergence rates for kernel estimation with dependent data. Econom. Theory 2008, 24, 726–748. [Google Scholar] [CrossRef]
Härdle, W.; Luckhaus, S. Uniform consistency of a class of regression function estimators. Ann. Statist. 1984, 12, 612–623. [Google Scholar] [CrossRef]
Hong, S.Y.; Linton, O.B. Asymptotic Properties of a Nadaraya-Watson Type Estimator for Regression Functions of Infinite Order; Cemmap Working Paper, No. CWP53/16; Centre for Microdata Methods and Practice (Cemmap): London, UK, 2016. [Google Scholar] [CrossRef]
Jiang, J.; Mack, Y.P. Robust local polynomial regression for dependent data. Stat. Sin. 2001, 11, 705–722. [Google Scholar]
Kulik, R.; Lorek, P. Some results on random design regression with long memory errors and predictors. J. Statist. Plann. Infer. 2011, 141, 508–523. [Google Scholar] [CrossRef]
Liero, H. Strong uniform consistency of nonparametric regression function estimates. Probab. Theory Rel. Fields 1989, 82, 587–614. [Google Scholar] [CrossRef]
Li, X.; Yang, W.; Hu, S. Uniform convergence of estimator for nonparametric regression with dependent data. J. Inequal. Appl. 2016, 142, 1–12. [Google Scholar] [CrossRef]
Linton, O.B.; Jacho-Chavez, D.T. On internally corrected and symmetrized kernel estimators for nonparametric regression. Test 2010, 19, 166–186. [Google Scholar] [CrossRef]
Mack, Y.P.; Silvermann, B.W. Weak and strong uniform consistency of kernel regression estimates. Z. Wahrscheinlichkeitstheor. Verw. Geb. 1982, 61, 405–415. [Google Scholar] [CrossRef]
Masry, E. Nonparametric regression estimation for dependent functional data. Stoch. Proc. Their Appl. 2005, 115, 155–177. [Google Scholar] [CrossRef]
Müller, H.-G. Density adjusted kernel smoothers for random design nonparametric regression. Stat. Probab. Lett. 1997, 36, 161–172. [Google Scholar] [CrossRef]
Nadaraya, E.A. Remarks on non-parametric estimates for density functions and regression curves. Theory Prob. Appl. 1970, 15, 134–137. [Google Scholar] [CrossRef]
Roussas, G.G. Nonparametric regression estimation under mixing conditions. Stoch. Proc. Appl. 1990, 36, 107–116. [Google Scholar] [CrossRef]
Shen, J.; Xie, Y. Strong consistency of the internal estimator of nonparametric regression with dependent data. Stat. Probab. Lett. 2013, 83, 1915–1925. [Google Scholar] [CrossRef]
Chen, J.; Gao, J.; Li, D. Estimation in semi-parametric regression with non-stationary regressors. Bernoulli 2012, 18, 678–702. [Google Scholar] [CrossRef]
Karlsen, H.A.; Myklebust, T.; Tjøstheim, D. Nonparametric estimation in a nonlinear cointegration type model. Ann. Statist. 2007, 35, 252–299. [Google Scholar] [CrossRef]
Linton, O.; Wang, Q. Nonparametric transformation regression with nonstationary data. Econom. Theory 2016, 32, 1–29. [Google Scholar] [CrossRef]
Wang, Q.; Chan, N. Uniform convergence rates for a class of martingales with application in non-linear cointegrating regression. Bernoulli 2014, 20, 207–230. [Google Scholar] [CrossRef]
Benelmadani, D.; Benhenni, K.; Louhichi, S. Trapezoidal rule and sampling designs for the nonparametric estimation of the regression function in models with correlated errors. Statistics 2020, 54, 59–96. [Google Scholar] [CrossRef]
Benhenni, K.; Hedli-Griche, S.; Rachdi, M. Estimation of the regression operator from functional fixed-design with correlated errors. J. Multivar. Anal. 2010, 101, 476–490. [Google Scholar] [CrossRef]
Beran, J.; Feng, Y. Local polynomial estimation with a FARIMA-GARCH error process. Bernoulli 2001, 7, 733–750. [Google Scholar] [CrossRef][Green Version]
Gu, W.; Roussas, G.G.; Tran, L.T. On the convergence rate of fixed design regression estimators for negatively associated random variables. Stat. Probab. Lett. 2007, 77, 1214–1224. [Google Scholar] [CrossRef]
Tang, X.; Xi, M.; Wu, Y.; Wang, X. Asymptotic normality of a wavelet estimator for asymptotically negatively associated errors. Stat. Probab. Lett. 2018, 140, 191–201. [Google Scholar] [CrossRef]
Wu, J.S.; Chu, C.K. Nonparametric estimation of a regression function with dependent observations. Stoch. Proc. Appl. 1994, 50, 149–160. [Google Scholar] [CrossRef]
Zhou, X.; Zhu, F. Asymptotics for L1-wavelet method for nonparametric regression. J. Inequal. Appl. 2020, 216, 1–11. [Google Scholar] [CrossRef]
Einmahl, U.; Mason, D.M. Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 2005, 33, 1380–1403. [Google Scholar] [CrossRef]
Ioannides, D.A. Consistent nonparametric regression: Some generalizations in the fixed design case. J. Nonparametr. Stat. 1993, 2, 203–213. [Google Scholar] [CrossRef]
Liang, H.-Y.; Jing, B.-Y. Asymptotic properties for estimates of nonparametric regression models based on negatively associated sequences. J. Multivar. Anal. 2005, 95, 227–245. [Google Scholar] [CrossRef]
Zhou, L.; Lin, H.; Liang, H. Efficient estimation of the nonparametric mean and covariance functions for longitudinal and sparse functional data. J. Amer. Statist. Assoc. 2018, 113, 1550–1564. [Google Scholar] [CrossRef]
Hall, P.; Müller, H.-G.; Wang, J.-L. Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist. 2006, 34, 1493–1517. [Google Scholar] [CrossRef]
Kokoszka, P.; Reimherr, M. Introduction to Functional Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
Li, Y.; Hsing, T. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Ann. Statist. 2010, 38, 3321–3351. [Google Scholar] [CrossRef]
Lin, Z.; Wang, J.-L. Mean and Covariance Estimation for Functional Snippets. J. Amer. Statist. Assoc. 2020, 117, 348–360. [Google Scholar] [CrossRef] [PubMed]
Yao, F. Asymptotic distributions of nonparametric regression estimators for longitudinal or functional data. J. Multivar. Anal. 2007, 98, 40–56. [Google Scholar] [CrossRef]
Yao, F.; Müller, H.-G.; Wang, J.-L. Functional data analysis for sparse longitudinal data. J. Amer. Statist. Assoc. 2005, 100, 577–590. [Google Scholar] [CrossRef]
Zhang, J.-T.; Chen, J. Statistical inferences for functional data. Ann. Statist. 2007, 35, 1052–1079. [Google Scholar] [CrossRef]
Zhang, X.; Wang, J.-L. From sparse to dense functional data and beyond. Ann. Statist. 2016, 44, 2281–2321. [Google Scholar] [CrossRef]
Zheng, S.; Yang, L.; Hardle, W. A smooth simultaneous confidence corridor for the mean of sparse functional data. J. Am. Statist. Assoc. 2014, 109, 661–673. [Google Scholar] [CrossRef]
Hsing, T.; Eubank, R. Theoretical Foundations of Functional Data Analysis; with an Introduction to Linear Operators; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Müller, H.-G. Functional modelling and classification of longitudinal data. Scand. J. Statist. 2005, 32, 223–246. [Google Scholar] [CrossRef]
Wang, J.-L.; Chiou, J.-M.; Müller, H.-G. Functional Data Analysis. Ann. Rev. Statist. 2016, 3, 257–295. [Google Scholar] [CrossRef]
Borisov, I.S.; Linke, Y.Y.; Ruzankin, P.S. Universal weighted kernel-type estimators for some class of regression models. Metrika 2021, 84, 141–166. [Google Scholar] [CrossRef]
Linke, Y.Y. Towards insensitivity of Nadaraya–Watson estimators to design correlation. Theory Probab. Appl. 2022, 67. [Google Scholar]
Linke, Y.Y.; Borisov, I.S. Insensitivity of Nadaraya–Watson estimators to design correlation. Commun. Stat. Theory Methods 2021. [Google Scholar] [CrossRef]
Linke, Y.Y.; Borisov, I.S. Constructing initial estimators in one-step estimation procedures of nonlinear regression. Statist. Probab. Lett. 2017, 120, 87–94. [Google Scholar] [CrossRef]
Linke, Y.Y. Asymptotic properties of one-step M-estimators. Commun. Stat. Theory Methods 2019, 48, 4096–4118. [Google Scholar] [CrossRef]
Linke, Y.Y.; Borisov, I.S. Constructing explicit estimators in nonlinear regression problems. Theory Probab. Appl. 2018, 63, 22–44. [Google Scholar] [CrossRef]
Cai, T.T.; Yuan, M. Optimal estimation of the mean function based on discretely sampled functional data: Phase transition. Ann. Statist. 2011, 39, 2330–2355. [Google Scholar] [CrossRef]
Wu, H.; Zhang, J.-T. Nonparametric Regression Methods for Longitudinal Data Analysis: Mixed-Effects Modeling Approaches; John Wiley and Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Cao, G.; Wang, L.; Li, Y.; Yang, L. Oracle-efficient confidence envelopes for covariance functions in dense functional data. Stat. Sin. 2016, 26, 359–383. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining; Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Shalnova, S.A.; Drapkina, O.M. Significance of the ESSE-RF study for the development of prevention in Russia. Cardiovasc. Ther. Prev. 2020, 19, 2602, In Russian. [Google Scholar] [CrossRef]
Shalnova, S.A.; Kutsenko, V.A.; Kapustina, A.V.; Yarovaya, E.B.; Balanova YuA, E.S.; Imaeva, A.E.; Maksimov, S.A.; Muromtseva, G.A.; Kulakova, N.V.; Kalachikova, O.N.; et al. Associations of Blood Pressure and Heart Rate and Their Contribution to the Development of Cardiovascular Complications and All-Cause Mortality in the Russian Population of 25–64 Years. Ration. Pharmacother. Cardiol. 2020, 16, 759–769, In Russian. [Google Scholar] [CrossRef]
Chentsov, N.N. Weak convergence of stochastic processes whose trajectories have no discontinuities of the second kind and the heuristic approach to the Kolmogorov-Smirnov tests. Theory Probab. Appl. 1956, 1, 140–144. [Google Scholar] [CrossRef]

Figure 1. Example 2. Sample observations, target function, and two estimators.

Figure 2. The maximum (left) and mean squared (right) errors in Example 2. For the mean squared error, the random forest model performed worse (10.97 (10.55, 11.39)) than the GAM model and the kernel estimators, so the results of the random forest model “did not fit” into the plot.

Figure 3. Example 3. Sample observations, target function, and two estimators.

Figure 4. The maximum (left) and mean squared (right) errors in Example 3. For the mean-squared error, the random forest model performed worse (6.699 (6.412, 7.046)) than the GAM model and the kernel estimators, so the results of the random forest model “did not fit” into the plot.

Figure 5. Example 4. Sample observations, target function, and two estimators.

Figure 6. The maximum (left) and mean squared (right) errors in Example 4. As before, for the mean squared error, the results of the random forest model (13.95 (11.69, 16.18)) are not shown in full on the graph. In addition, the outliers for the GAM, NW, ULC, and ULL estimators are “cut off” in this graph.

Figure 7. The maximum (left) and mean-squared (right) errors in Example 5. As before, for the mean-squared error, the results of the random forest model are not shown in full on the graph. In addition, the outliers for the NW, ULC, and ULL estimators are “cut off” in this graph.

Figure 8. Mean-squared prediction error of the dependence of BP from HR.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.