Next Article in Journal
Reliability Analysis of Kavya Manoharan Kumaraswamy Distribution under Generalized Progressive Hybrid Data
Previous Article in Journal
Laplace-Domain Hybrid Distribution Model Based FDIA Attack Sample Generation in Smart Grids
Previous Article in Special Issue
Quantifying the Effect Size of Exposure-Outcome Association Using δ-Score: Application to Environmental Chemical Mixture Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exact Boundary Correction Methods for Multivariate Kernel Density Estimation

Department of Mathematics and Big Data Science, Kumoh National Institute of Technology, 61 Daehak-Ro, Gumi 39177, Republic of Korea
Symmetry 2023, 15(9), 1670; https://doi.org/10.3390/sym15091670
Submission received: 30 July 2023 / Revised: 24 August 2023 / Accepted: 28 August 2023 / Published: 30 August 2023
(This article belongs to the Special Issue Nonparametric Statistics and Biostatistical Methods)

Abstract

:
This paper develops a method to obtain multivariate kernel functions for density estimation problems in which the density function is defined on compact support. If domain-specific knowledge requires certain conditions to be satisfied at the boundary of the support of an unknown density, the proposed method incorporates the information contained in the boundary conditions into the kernel density estimators. The proposed method provides an exact kernel function that satisfies the boundary conditions, even for small samples. Existing methods primarily deal with a one-sided boundary in a one-dimensional problem. We consider density in a two-sided interval and extend it to a multi-dimensional problem.

1. Introduction

Many variables of interest have their natural boundaries. The propensity score is the probability of an individual receiving treatment and ranges between zero and one. The variant allele frequency (VAF) is an essential measure of genetic variation, ranging from zero to one. The visual analog scale (VAS) score for pain ranges from 0 to 100 mm. When a researcher is interested in the shape of the density function of a bounded variable, conventional kernel methods, commonly used in statistical analysis, may not accurately capture the true shape or slope of the unknown density function, particularly near the boundary.
In semiparametric statistical problems, a nonparametrically estimated density or function can be used as an input for the second-stage estimation or testing procedure of the primary parameter of interest (see [1,2]). Many estimators that achieve semiparametric efficiency bounds have been obtained using this approach. When a nonparametric estimate serves as an intermediate input of statistical inference, the natural boundary of the variable is crucial because ignoring it can introduce significant boundary biases.
Several methods have been proposed for solving the density estimation problem within a bounded interval. These methods include reflection [3], pseudo-data generation [4], boundary kernel [5,6], and data transformation [7,8]. Among these, only the reflection method provides an exact kernel function when applied to a one-sided bounded domain problem. The other methods offer asymptotic solutions; the resulting kernel density estimators satisfy the boundary conditions only asymptotically.
When the support of the variable of interest is truly bounded (e.g., a two-sided interval), the reflection method is no longer exact. In the literature, it is often suggested that methods designed for one-sided domains can be extended to domains with two-sided boundaries by applying the method separately to each boundary. However, this suggestion is only justifiable in an asymptotic sense and does not yield an exact kernel function. This paper presents a systematic method for constructing exact kernel functions on a compact support with boundary conditions. The proposed method is adequate for the most plausible boundary conditions. Several examples designed for different boundary conditions are discussed.
Multi-dimensional extensions are essential in function and density estimation problems. All the methods in the literature have focused on the one-dimensional case, the function or density estimation problem on a real line. This paper presents a multivariate kernel density estimator designed for a compact support with boundary conditions. This is a new and significant contribution to the literature.
This study builds upon the method developed by [9,10], who proposed a kernel density estimator as a solution to the heat (or diffusion) equation. Ref. [10] examined the relationship between kernel- and penalty-based methods for density estimation problems from this perspective. Ref. [9] derived the kernel function that satisfies a boundary condition at the endpoints. In comparison to the aforementioned studies, this study extends the method’s scope to multivariate density estimation problems. Both [9,10] focused solely on one-dimensional density estimation problems. Furthermore, in contrast to [10], we consider the problem of finite support with boundary conditions. In contrast to [9], we allow for general boundary conditions beyond the zero-derivative conditions at the boundary.
The remainder of this paper is organized as follows. Section 2 and Section 3 describe the procedure for deriving the exact kernel function on a compact support. In Section 4, we provide examples of kernel functions designed for different boundary conditions. Section 5 compares the proposed method with existing approaches, particularly the reflection method. Section 6 presents a multivariate extension. Section 7 assesses the finite sample performance of the estimators by simulation. In Section 8, we present a real data example, and Section 9 concludes the paper. The proofs of the main results can be found in Appendix A.

2. Overview: Kernel Density Estimation and the Heat Equation

Kernel density estimation is a fundamental tool in nonparametric and semiparametric statistical methods. Let X 1 , , X n be observations drawn from an unknown distribution F with density f. The following Gaussian kernel density estimator is a widely used estimator of f:
f ^ ( x ) = 1 2 π n h i = 1 n exp ( x X i ) 2 2 h 2 .
This kernel density estimator can be regarded as a solution to the heat (or diffusion) equation. This heat (or diffusion) equation interpretation of kernel smoothing was used in [9,10] and is well-established in the imaging literature. Recent studies that employ this relationship include [11,12]. In this study, we apply the same perspective to density estimation problems with nontrivial boundary conditions. By modifying the boundary conditions of the heat equation, we derive various forms of exact kernel functions for the density estimation on a compact support. Subsequently, we extend this approach to address multivariate density estimation problems.
We establish a connection between the kernel smoothing method to the heat equation, focusing on a one-dimensional problem. The empirical distribution function is obtained by assigning equal mass 1 / n to each data point X i :
F n ( x ) = 1 n i = 1 n I ( X i x ) .
Its derivative, known as the empirical density, is a linear combination of n point masses, each positioned at one of the data points.
d F n ( x ) = 1 n i = 1 n δ ( x X i )
where δ ( · ) is the Dirac delta function. Let a kernel function K ( · ) be a well-defined probability density function, and define K h ( · ) = h 1 K ( · / h ) . The bandwidth h serves as a scale parameter for the probability density K ( · ) . The kernel density estimator is defined as the convolution of the kernel and empirical density function:
f ^ ( x ) = K h ( x u ) d F n ( u ) = 1 n h i = 1 n K x X i h .
Consider a rod with an initial temperature described by the function g ( x ) . Let ϕ ( x , t ) denote the temperature at a point x and time t. If the rod is sufficiently long, the influence of the boundary on the temperature distribution can be neglected; therefore, we can model the phenomenon over the entire line < x < . Let ϕ x ( x , t ) and ϕ x x ( x , t ) be the first and second derivatives with respect to the first argument, and ϕ t ( x , t ) be the first derivative with respect to the second argument. Consequently, we determine the temperature ϕ ( x , t ) using the following equations:
ϕ t ( x , t ) = 1 2 ϕ x x ( x , t ) < x < , t > 0
ϕ ( x , 0 ) = g ( x ) < x <
ϕ ( x , t ) 0 as x ± .
The heat or diffusion Equation (3) describes the propagation of heat over time t and location x. Equation (4) specifies the initial temperature at time t = 0 , while Equation (5) represents the boundary condition. Proposition 1 demonstrates that the solution to the heat equation is
ϕ ( x , t ) = 1 2 π t g ( u ) exp ( x u ) 2 2 t d u .
Suppose the data provide the initial condition g ( · ) . This implies that the empirical density function d F n ( · ) serves as the initial condition. Then, the solution takes the following form:
ϕ ( x , t ) = 1 2 π t 1 n i = 1 n δ ( u X i ) exp ( x u ) 2 2 t d u = 1 2 π t n i = 1 n exp ( x X i ) 2 2 t .
By setting t = h 2 , we can identify the solution (7) to the heat equation as the kernel density estimate in Equation (1). The role of time t (more precisely t ) as a smoothing parameter h is intuitive when considering the heat equation. The heat or diffusion equation describes how heat or a liquid object initially accumulates at a single point and then diffuses over time. Figure 1 illustrates this phenomenon. In the initial stage at t = 0 , all the heat is concentrated at a single point. As time passes, t = 0.4 , 0.6 , 0.8 , , heat spreads to the neighborhood of the initial point and is eventually felt even in remote areas.
Proposition 1 illustrates the solution to the heat equation. The proof can be found in standard textbooks such as [13,14,15,16]. We include it in an appendix because it illustrates the fundamental procedure that is repeatedly employed throughout this study.
Proposition 1. 
The solution to the heat Equation (3), subject to the initial condition (4) and boundary condition (5), is
ϕ ( x , t ) = 1 2 π t g ( u ) exp ( ( x u ) 2 2 t ) d u .
The solution consists of two steps. In the first step, we utilize the heat Equation (3) and boundary condition (5) to derive the heat kernel function. In the above example, the heat kernel is Gaussian, H ( x , u , t ) = 1 2 π t exp ( ( x u ) 2 2 t ) d u . The kernel function depends solely on x u ; therefore, it can be written as H ( x u , t ) . In the second step, we use the observed data as an initial condition and perform the convolution of the heat kernel with the empirical density, resulting in H ( x u , t ) d F n ( u ) .
In the subsequent sections, this procedure will be employed to derive the precise boundary kernel for density estimation on a compact support with general boundary conditions.

3. Method: Kernel Density Estimation on a Finite Interval with Boundary Conditions

We assume prior knowledge or make assumptions regarding the boundary behavior of an unknown density function. Some variables naturally possess conditions at their boundaries. Occasionally, we can estimate the boundary value or slope of the density function during the first stage. Fixing the value of the density function at a boundary is referred to as the Dirichlet condition, while predetermining the slope of the density function at a boundary is known as the Neumann condition. More general types of boundary conditions will be introduced in the following section.
Suppose the density function is supported on an interval [ 0 , l ] . We focus on cases where the support of the density is known. However, if, for instance, the right end point l is unknown, it would be reasonable to estimate it using the maximum value of the sample. Assume that the slope of the density at the boundary is 0. We can formulate the problem as a heat equation with the Neumann boundary condition:
ϕ t ( x , t ) = 1 2 ϕ x x ( x , t ) , 0 < x < l , t > 0
ϕ ( x , 0 ) = g ( x ) , 0 < x < l
ϕ x ( 0 , t ) = 0 , ϕ x ( l , t ) = 0 , t > 0 .
This type of boundary condition is known as a perfect insulation condition. The solutions to these equations provide the exact kernel, as summarized in Proposition 2.
Proposition 2. 
The solution of the heat Equation (8), subject to the perfect insulation condition (10) at both boundaries, is given by
ϕ ( x , t ) = 0 l H ( x , u , t ) g ( u ) d u ,
where the corresponding kernel is
H ( x , u , t ) = 1 l + 2 l m = 1 exp m 2 π 2 t 2 l 2 cos m π u l cos m π x l .
A kernel density estimate is obtained with the initial condition g ( x ) = 1 n i = 1 n δ ( u X i ) :
f ^ ( x , t ) = 1 l + 2 n l i = 1 n m = 1 exp m 2 π 2 t 2 l 2 cos m π X i l cos m π x l .
The solution (12) is akin to orthogonal series estimators, where a trigonometric series serves as the basis for the expansion. The primary distinction lies in the rapid decay of the factor exp ( m 2 π 2 t 2 l 2 ) as m increases. Due to this dampening factor, the influence of additional terms in series (12) diminishes quickly. Unlike orthogonal series estimators, Equation (12) does not “collapse” to the data itself, preventing “Dirac’s disaster”. This implies that smoothing or truncating the series is not crucial (see [3], pp. 23–25). Because the contributions of the additional terms are negligible in our solution (12), the first few terms in the series determine most of the shape of the density. The series uniformly converges over the entire support. However, when numerically evaluating the density estimate in Proposition 2, it becomes necessary to determine the number of terms to add. In most cases, using the first 50 terms suffices to obtain a satisfactory approximation of the solution.
Next, we consider the density estimation under the Dirichlet condition. Suppose that the value of the density function is known at the boundary, and the density function is 0 at both endpoints. This particular problem can be expressed as a heat Equation (8), with the initial condition (9) and boundary condition:
ϕ ( 0 , t ) = 0 , ϕ ( l , t ) = 0 , t > 0 .
Proposition 3. 
The solutions of Equations (8), (9), and (13) are given by
ϕ ( x , t ) = 0 l H ( x , u , t ) g ( u ) d u
with kernel function
H ( x , u , t ) = 2 l m = 1 exp m 2 π 2 t 2 l 2 sin m π u l sin m π x l
The resulting kernel density estimate is
f ^ ( x , t ) = H ( x , u , t ) d F n ( u ) = 2 n l i = 1 n m = 1 exp m 2 π 2 t 2 l 2 sin m π X i l sin m π x l .
Consider a scenario where the information generated by data is perfectly absorbed upon reaching a boundary. This situation is analogous to having large reservoirs of 0 temperature at both ends of the domain. When heat flow reaches the boundary from the interior, it is rapidly absorbed (cooled), maintaining the boundary temperature at 0. Because, within a finite interval, the amount of information the data supplies is always positive, it is evident that we will lose a certain amount of information through the boundaries. Therefore, under the Dirichlet condition, the total amount of information no longer sums to one, and the resulting kernel density estimate is no longer a proper density function. This deficiency can be easily addressed. The heat (or information) lost can be inferred by calculating the heat (or information) conserved within the domain. At a fixed t, ϕ ( x , t ) represents the amount of information at location x, and the total information remaining in the domain is
0 l ϕ ( x , t ) d x = 0 l f ^ ( x , t ) d x .
Then the proper density function is
f ˇ ( x , t ) = 0 l 2 n l i = 1 n m = 1 exp ( m 2 π 2 t 2 l 2 ) sin m π X i l sin m π x l d x 1 × 2 n l i = 1 n m = 1 exp ( m 2 π 2 t 2 l 2 ) sin m π X i l sin m π x l .
Whether it is necessary to modify the density estimate using the 0 boundary condition in order to obtain the proper density should be determined based on the specific problem.

4. Method: General Boundary Conditions in One-Dimensional Problems

In this section, we discuss the more general problems. Suppose that, at the endpoints, the values of a density function are given by f ( 0 ) = α , f ( l ) = β for α > 0 , β > 0 . This is a heat equation problem with the Dirichlet condition and is described by the diffusion Equation (8), initial condition (9), and boundary conditions:
ϕ ( 0 , t ) = α , ϕ ( l , t ) = β , t > 0 .
Proposition 4 identifies the solution to this problem.
Proposition 4. 
The solutions to Equations (8), (9), and (16) are
ϕ ( x , t ) = α + ( β α ) x l + 2 l m = 1 exp m 2 π 2 t 2 l 2 1 n i = 1 n sin m π X i l 0 l ( α + β α l u ) sin m π u l d u sin m π x l = α + ( β α ) x l + 2 l m = 1 exp m 2 π 2 t 2 l 2 1 n i = 1 n sin m π X i l l m π ( α β ( 1 ) m ) sin m π x l .
As a second example, we consider a mixed boundary condition. On one side of the support, the first derivative of the density is given; however, the density value is known on the other side. We can apply the Neumann condition at the left end and the Dirichlet condition at the right end. This problem can be represented as a heat equation with the following boundary conditions:
ϕ x ( 0 , t ) = 0 , ϕ ( l , t ) = 0 , t > 0 .
The solution is presented in Proposition 5.
Proposition 5. 
The solutions to Equations (8), (9), and (17) are
f ^ ( x , t ) = 2 n l i = 1 n m = 1 exp ( m 1 / 2 ) 2 π 2 t 2 l 2 cos ( m 1 / 2 ) π X i l cos ( m 1 / 2 ) π x l .
Next, suppose a certain type of symmetry exists at the boundary, with the same density values and derivatives at the ends. To highlight the symmetry, consider a boundary [ l , l ] ; then, f ( l ) = f ( l ) , f ( l ) = f ( l ) . It can be represented as a diffusion equation with the following boundary conditions:
ϕ ( l , t ) = ϕ ( l , t ) , t > 0 , ϕ x ( l , t ) = ϕ x ( l , t ) , t > 0 .
Proposition 6 identifies the solution for symmetric boundary conditions.
Proposition 6. 
The solutions to Equations (8), (9), and (18) are
f ^ ( x , t ) = 1 2 l + 1 n l i = 1 n m = 1 exp m 2 π 2 t 2 l 2 cos m π X i l cos m π x l + sin m π X i l sin m π x l .
It is impossible to cover all possible kernels for every conceivable boundary condition. However, there is a general principle at play. One can render a density estimation problem to a heat/diffusion equation problem in order to determine an appropriate kernel with a specific boundary behavior. If the solution of that particular case is known, either analytically or numerically, in the context of the heat equation, it can be applied to density estimation problems.

5. Discussion: Comparisons with Existing Boundary Kernel Estimators

Several well-known methods exist for density estimation within a bounded interval. This section highlights similarities and differences between the newly proposed exact kernel estimator and the existing methods. The reflection method [3] provides exact kernel functions when the density is defined on a half interval with specific boundary conditions. The pseudo-data generation method [4] is a similar variant. This method differs from the reflection method in the way it generates reflected data. These two methods are the closest to the approach used in the present study. Additional methods include the boundary kernel [5,6] and transformation methods [7,8].
Except for the reflection method, existing methods do not provide an exact kernel function and are only valid asymptotically. Most of these methods are developed for a domain with a one-sided boundary, such as a positive half interval. While these methods can sometimes be extended to a finite interval with two boundaries, it is not always possible. Additionally, extending these methods to higher dimensions can be challenging for multi-dimensional problems.
Unlike existing methods, the proposed approach is applicable to a finite domain, two-sided interval. Extending this approach to multivariate problems is straightforward, and this multivariate kernel density estimation on a bounded domain constitutes the main contribution of this paper. Furthermore, with almost any boundary condition, it is possible to find an exact kernel that provides a valid density estimate.
Because the reflection and ‘negative’ reflection methods [3] are the most basic strategy in the literature for problems within a bounded domain and can be considered as a special case of the diffusion equation-based method, we will focus on the reflection method in this section. The origins of the reflection and negative reflection [3] can be traced back to [17], where formulas for a half-line density estimation problem were proposed. While some properties were stated, the source and the contextual understanding of these formulas and their properties were not explicitly mentioned.
First, we describe the (negative) reflection method. Suppose that a sample ( X 1 , , X n ) is drawn from an unknown density on a half-line [ 0 , ) . The reflection method augments the data by copying the original observations on the opposite side of the domain ( X 1 , X 1 , X n , X n ) . Applying the normal kernel density method to the augmented data yields
f ^ ( x ) = 1 n h i = 1 n K x X i h + K x + X i h .
If the kernel function is symmetric around 0 and differentiable, the first derivative of the density function is always 0 at the boundary, f ^ ( 0 ) = 0 . It is because
f ^ ( x ) = 1 n h i = 1 n K x + X i h + K x + X i h ; therefore , f ^ ( x ) = 1 n h i = 1 n 1 h K x + X i h + K x + X i h , and   finally f ^ ( 0 ) = 1 n h 2 i = 1 n K X i h + K X i h = 0 .
The negative reflection method is defined in a similar manner. Augment the data and apply the usual kernel method, but with weights of 1 to the reflected data:
f ^ ( x ) = 1 n h i = 1 n K x X i h K x + X i h .
We can show that the boundary value of the density estimate is always 0, f ^ ( 0 ) = 0 because
f ^ ( 0 ) = 1 n h i = 1 n K X i h K + X i h = 1 n h i = 1 n K X i h K X i h = 0 .
With the Gaussian kernel, the density estimates with the reflection and negative reflection methods are
f ^ ( x ) = 1 2 π t n i = 1 n exp ( x X i ) 2 2 t + exp ( x + X i ) 2 2 t ,
f ^ ( x ) = 1 2 π t n i = 1 n exp ( x X i ) 2 2 t exp ( x + X i ) 2 2 t .
Equations (19) and (20) provide the exact kernel functions for density estimation on a half-line with corresponding boundary conditions. However, this method is best understood when posed as a heat equation problem. The framework of the heat equation elucidates why (19) or (20) should be proposed as solutions to this problem. The aforementioned reflection methods solve the heat equation defined on a half-line with the corresponding boundary conditions. Consider the heat equation on the positive half-line. Our exposition on this subject is drawn from [15].
ϕ t ( x , t ) = 1 2 ϕ x x ( x , t ) , 0 < x < , t > 0 ϕ ( x , 0 ) = g ( x ) , 0 < x < ϕ x ( 0 , t ) = 0 , 0 < x < , t > 0 .
One method to solve this half-line problem is to use even extensions. Because the initial condition g ( x ) is defined solely on a half-line, we define its even extension as
g e v e n ( x ) = g ( x ) for   x 0 + g ( x ) for   x < 0
and consider the entire line problem with a new initial condition g e v e n ( x )
φ t ( x , t ) = 1 2 φ x x ( x , t ) , 0 < x < , t > 0 φ ( x , 0 ) = g e v e n ( x ) , 0 < x < .
Its solution, which addresses the entire line problem, is presented in Proposition 1. Let us denote H ( x , u , t ) = H ( x u , t ) the Gaussian kernel 1 2 π t exp ( ( x u ) 2 2 t ) , then
φ ( x , t ) = H ( x u , t ) g e v e n ( u ) d u .
Restricting to positive half-lines,
ϕ ( x , t ) = φ ( x , t ) for   x > 0
is a unique solution to the original half-line problem. The boundary condition ϕ x ( 0 , t ) = 0 is inherently satisfied because φ ( x , t ) is an even function. Its derivative with respect to the first argument, φ x ( x , t ) , is an odd function; therefore, at the origin, φ x ( 0 , t ) = ϕ x ( 0 , t ) = 0 .
The explicit form of ϕ ( x , t ) is φ ( x , t ) :
φ ( x , t ) = H ( x u , t ) g e v e n ( u ) d u = 0 H ( x u , t ) g ( u ) d u + 0 H ( x u , t ) g ( u ) d u = 0 H ( x u , t ) g ( u ) d u 0 H ( x + u , t ) g ( u ) d u = 0 [ H ( x u , t ) + H ( x + u , t ) ] g ( u ) d u
The second equality arises from the definition of g e v e n ( x ) , and the third equality is the change in variable u = u . The solution ϕ ( x , t ) is derived by restricting φ ( x , t ) to x > 0 . Finally, acknowledging the Gaussian kernel and setting g ( x ) = d F n ( x ) yield the following explicit formula:
ϕ ( x , t ) = 1 2 π t 0 exp ( x u ) 2 2 t + exp ( x + u ) 2 2 t d F n ( u ) = 1 2 π t n i = 1 n exp ( x X i ) 2 2 t + exp ( x + X i ) 2 2 t .
This is exactly the same as the ‘reflection method’ (19) when we set t = h 2 . The even extension of the initial condition involves the process of augmenting the data by reflecting the original observations on the negative half-line.
Similar to the even extension explaining how the reflection method is obtained, the odd extension provides a rationale for the negative reflection method. We define an odd extension as
g o d d ( x ) = g ( x ) for   x 0 g ( x ) for   x < 0
following the same logic and arriving at the same result as the negative reflection solution (20). We verify whether the odd extension ensures ϕ ( 0 , t ) = 0 for the boundary condition.
In the above discussion, our framework explains why these methods should be the solution of kernel density on a one-sided domain. However, this also underscores the limitations of the simple reflection method. Suppose the data are generated from a two-sided finite interval, and the derivative of the density function is 0 on both sides. One might be tempted to repeat the reflection procedure twice, once for the left side and once for the right side. However, the kernel obtained in this way is not the exact kernel for the finite-interval density estimation. The data should be reflected an infinite number of times. The exact kernel obtained through the reflection method is
H ( x , u , t ) = 1 2 π t n i = 1 n m = exp ( x X i 2 l m ) 2 2 t + exp ( x + X i 2 l m ) 2 2 t .
This is obtained when we first reflect a data point X i on the negative side X i and periodically reflect both points with periods 2 l , that is, X i + 2 l m and X i + 2 l m for all integers m = ± 1 , ± 2 , This indefinite reflection is required because the reflected data X i not only affect the left boundary x = 0 but also the right side x = l . Another copy must be made at position X i + 2 l to cancel out the second undesired effect. However, this additional copy affects both boundaries x = 0 , l , and to cancel them out appropriately, an infinite number of reflections must be made.
There are two representations, Equations (11) and (21), for the same problem. Proposition 7 confirms that these two equations represent the same kernel function in different ways.
Each representation has its advantages. In Equation (11), the infinite series converges rapidly when the dampening factor exp ( m 2 π 2 t 2 l 2 ) vanishes quickly as m increases, which occurs when either t is large or l is small. Therefore, Equation (11) is suitable for the problem in which either the bandwidth is relatively large, or the length of the interval domain is narrow. However, Equation (21) converges rapidly when exp ( ( x X i 2 l m ) 2 2 t ) quickly approaches zero as m increases, which happens when t is small or l is large. Therefore, an infinite reflection representation is preferable when the bandwidth is relatively small or the density function domain is long.
For the Dirichlet condition, when the value of the density function on both sides is set to 0, we apply the negative reflection principle to derive
H ( x , u , t ) = 1 2 π t n i = 1 n m = exp ( x X i 2 l m ) 2 2 t ) exp ( ( x + X i 2 l m ) 2 2 t .
Proposition 7. 
For 0 < x < l and t > 0 , the kernel density estimates in (11) and (21) are equivalent. Additionally, (14) and (22) are equivalent.

6. Main Results: Exact Kernels for Multivariate Data

This section considers a multivariate density estimation on a compact support with a boundary condition. One observes data X 1 , , X n where X i = ( X i 1 , , X i k ) , a k-dimensional vector, is supported on a bounded domain Ω R k . Let Ω ¯ represent the closure of Ω and Ω denote the boundary of the domain. A Laplacian operator Δ is defined as Δ = 2 2 x 1 + + 2 2 x k . It is a natural generalization of the second-order derivative. Let x represent a vector ( x 1 , , x k ) .
Suppose that the density function takes value 0 on the boundary. To determine the exact kernel of the density estimation, consider a k-dimensional heat equation with Dirichlet boundary conditions:
ϕ t ( x , t ) = 1 2 Δ ϕ ( x , t ) , x Ω , t > 0 ϕ ( x , 0 ) = g ( x ) , x Ω ¯ ϕ ( x , t ) = 0 , x Ω , for t > 0 .
The strategy worked for the one-dimensional problem continues to be useful, namely, find the solution of the heat equation and use the empirical density d F n ( x ) as an initial condition g ( x ) . The eigenfunction expansion can be used for this purpose [18]. Consider the following time-independent problem.
Δ ϕ ( x ) + λ ϕ ( x ) = 0 in Ω ϕ ( x ) = 0 on Ω .
The values of λ that admit a nontrivial solution ϕ of the above equation are called the eigenvalues of Laplacian Δ , and the corresponding ϕ ’s are eigenfunctions. There is a countable set of eigenvalues { λ m } m = 1 with positive values, and for each λ m , there is a finite number of linearly independent eigenfunctions. Moreover, the eigenfunctions corresponding to distinct eigenvalues are orthogonal. Let us denote ( λ m , φ m ) as the eigenvalues and eigenfunctions of the Laplacian in a bounded domain Ω with Dirichlet boundary conditions.
Suppose that the initial condition g ( x ) can be expanded in the above eigenfunctions:
g ( x ) = m = 1 A m φ m ( x )
where Fourier coefficients A m = Ω g ( u ) φ m ( u ) d u . When the initial condition is data, i.e., a sequence of Dirac function, the meaning of the expansion should be understood in the sense that the infinite sum converges to g ( x ) in L 2 . Thus, Ω { g ( x ) m = 1 M A m φ m ( x ) } 2 d x 0 , M . The solution to the heat equation is
ϕ ( x , t ) = m = 1 A m e λ m t / 2 φ m ( x ) .
Plug in the coefficients A m and interchange the integration and summation to obtain
ϕ ( x , t ) = m = 1 Ω g ( u ) φ m ( u ) d u e λ m t / 2 φ m ( x ) = Ω m = 1 e λ m t / 2 φ m ( x ) φ m ( u ) g ( u ) d u .
This means that ϕ ( x , t ) can be written as a convolution of a kernel and the initial condition
ϕ ( x , t ) = Ω H ( x , u , t ) g ( u ) d u
where the kernel is defined as follows:
H ( x , u , t ) = m = 1 e λ m t / 2 φ m ( x ) φ m ( u ) .
Plug the empirical density into the initial condition and obtain a kernel density estimator
f ^ ( x , t ) = Ω H ( x , u , t ) 1 n i = 1 n δ ( u X i ) d u = 1 n i = 1 n m = 1 e λ m t / 2 φ m ( x ) φ m ( X i ) .
The specific form of the density estimator depends on several factors, including the dimension of the problem, the shape of the boundary, and the specific boundary condition. We consider examples where a multivariate density function is defined within a box.
Suppose that a multivariate density function is defined on Ω = ( 0 , a 1 ) × × ( 0 , a k ) , and its value on the boundary is 0. Consider the following eigenvalue problem:
ϕ x 1 x 1 + + ϕ x k x k + λ ϕ = 0 in Ω , ϕ ( 0 , x 2 , , x k ) = ϕ ( a 1 , x 2 , , x k ) = 0 , , ϕ ( x 1 , , x k 1 , 0 ) = ϕ ( x 1 , , x k 1 , a k ) = 0 .
Theorem 1 identifies its eigenvalues and eigenfunctions. Using the procedure described above, one can obtain the desired multivariate kernel function.
Theorem 1. 
The eigenvalues and eigenfunctions of Equation (25) are
λ m 1 , , m k = π 2 j = 1 k m j 2 a j 2 and φ m 1 , , m k ( x ) = j = 1 k sin m j π x j a j
where m 1 , , m k = 1 , 2 , 3 , . The corresponding kernel is
H ( x , u , t ) = 2 k j = 1 k a j m 1 , , m k = 1 e π 2 t · j = 1 k m j 2 / 2 a j 2 j = 1 k sin m j π x j a j sin m j π u j a j .
Moreover, from (24), the kernel density estimator for a given a bandwidth t is
f ^ ( x , t ) = Ω H ( x , u , t ) 1 n i = 1 n j = 1 k δ ( u j X i j ) d u j = 1 n i = 1 n 2 k j = 1 k a j m 1 , , m k = 1 e π 2 t · j = 1 k m j 2 / 2 a j 2 j = 1 k sin m j π x j a j sin m j π X i j a j .
The next example is a case where the partial derivatives of the density function are known on the boundary. Across the boundary, the first partial derivative of the density function is identically 0. To determine the eigenvalues and corresponding eigenfunctions of the Neumann problem, consider the following time-independent problem:
ϕ x 1 x 1 + + ϕ x k x k + λ ϕ = 0 in Ω , ϕ x 1 ( 0 , x 2 , , x k ) = ϕ x 1 ( a 1 , x 2 , , x k ) = 0 , , ϕ x k ( x 1 , , x k 1 , 0 ) = ϕ x k ( x 1 , , x k 1 , a k ) = 0 .
Theorem 2. 
The eigenvalues and eigenfunctions of Equation (26) are
λ m 1 , , m k = π 2 j = 1 k m j 2 a j 2 and φ m 1 , , m k ( x ) = j = 1 k cos m j π x j a j
where m 1 , , m k = 0 , 1 , 2 , . The kernel function is
H ( x , u , t ) = m 1 , , m k = 0 B ¯ m 1 , , m k 1 e π 2 t · j = 1 k m j 2 / 2 a j 2 j = 1 k cos m j π x j a j cos m j π u j a j
where constants B ¯ m 1 , , m k can be found in (A5). Moreover, the multivariate density estimator is
f ^ ( x , t ) = Ω H ( x , u , t ) 1 n i = 1 n j = 1 k δ ( u j X i j ) d u j = 1 n i = 1 n m 1 , , m k = 0 B ¯ m 1 , , m k 1 e π 2 t · j = 1 k m j 2 / 2 a j 2 j = 1 k cos m j π x j a j cos m j π X i j a j .
Note that the solution of the multivariate density estimation is the product of solutions of the corresponding univariate problems. This is because a rectangular domain ensures independence with respect to variables x 1 , , x k . A circular domain also produces a simple solution owing to its angular symmetry. If the shape of Ω is more challenging, the same procedure can still be used, but the solution of the heat equation may be found numerically.

7. Simulation Results

This section assesses the finite sample performance of the proposed exact kernel density estimator using simulated data. We generate samples from two models. In Model 1, we independently draw a sample of size n from a standard uniform distribution. In Model 2, the sample is drawn from a truncated normal distribution with a mean of zero, a variance of one, and lower and upper truncation points of 0 and 1, respectively. In both cases, the support of the data is the unit interval [ 0 , 1 ] . We evaluate the performance of several density estimators at 0, which is the left end of the support. The same analysis can be conducted for 1, the right end of the support. The results will be symmetric.
Consider four estimators. (i) f ^ n ( · ) denotes the ordinary kernel density estimator in (1) that has no boundary bias correction. (ii) f ^ r ( · ) is the density estimator from the one-sided reflection method. As explained in Section 5, it is the kernel density estimator using an augmented sample ( X 1 , , X n , X 1 , , X n ) . (iii) f ^ d r ( · ) is the estimator from the two-sided reflection method. In addition to the sample used in the one-sided reflection, this two-sided method uses additional data points ( 2 X 1 , , 2 X n ) . (iv) f ^ e ( · ) is the exact kernel density estimator obtained in Proposition 2. This estimator imposes a condition that the slopes of f ( · ) at the boundary points are equal to zero. As explained in Section 5, the reflection methods impose the same constraints at the boundary.
We try three different values for the sample size n and five different values for the bandwidth h, as shown in Table 1. Recall that the smoothing parameter t for the exact kernel method is related to h by t = h 2 . Let f ^ ( j ) ( 0 ) represent the density estimate at 0 for the j-th simulation run, and let L be the total number of simulation repetitions. Then the bias and root mean squared errors are calculated by
Bias = L 1 j = 1 L f ^ ( j ) ( 0 ) f ( 0 ) , RMSE = L 1 j = 1 L f ^ ( j ) ( 0 ) f ( 0 ) 2 .
Table 1 shows the results for Model 1. Overall, the ordinary kernel density estimator f ^ n proves to be inadequate. It should not be used in empirical studies when the support of the density function is bounded. The one-sided reflection method f ^ r significantly mitigates the boundary bias problem, and the two-sided reflection method f ^ d r performs even better than its one-sided counterpart. However, they tend to exhibit larger biases when the bandwidth value is large. As explained in Section 5, reflecting the sample only once over the boundary is not sufficient to correct the boundary bias problem, and this tendency becomes more pronounced with larger bandwidth values. Regarding the bias–variance trade-off, note that a smaller bandwidth results in smaller bias but larger variance, while a larger bandwidth has the opposite effect. Finally, the exact kernel density estimator f ^ e proves to be the most effective in controlling the boundary bias problem. It remains effective even in small samples and with larger bandwidth values.
Table 2 shows comparable results for Model 2. We observe the same overall tendencies as before. However, unlike in Model 1, the biases in Model 2 do not disappear even as the sample size increases. This may be attributed the fact that the zero-slope conditions at the boundary may not be satisfied. Since this constraint is imposed in all three boundary correction methods considered in this section, it could result in a persistent bias. Nonetheless, even in this case, it is evident that the exact kernel density estimator is the most effective approach for addressing the boundary bias issue.

8. Real Data Example

The propensity score is the probability of an individual receiving treatment given their characteristics. One crucial issue is whether the propensity score model is well-specified. Because the estimated propensity score is a critical step in controlling selection bias in non-experimental data, a misspecified propensity score may lead to incorrect inference of the treatment effect. One strategy in the literature for checking the model specification is to compare the estimated densities of the propensity scores between the treatment and control groups [19]. This demonstrates the importance of recognizing boundary effects in the density estimation step.
We illustrate this point using a nationally supported work (NSW) demonstration experiment. We used a sample from [20]. Since male participants in the NSW data are a focal point in the literature, we considered only male participants. For the non-experimental comparison group, we used [20]’s full PSID sample. There were 297 treatment observations and 2490 control observations. Most studies use parametric models, such as logit or probit, to estimate the propensity score; therefore, we fit the probability of being in the treatment group using a logit model. (Data are available in an R package rrp. The individual characteristics used to fit the logit model are the linear terms of all explanatory variables (except post-experiment income) in dataset LLvsPSID in rrp). Figure 2 shows histograms of the estimated propensity scores of the treatment and control groups. There were many instances where the estimated probabilities were close to the boundaries. In the control group, most individuals’ propensity scores were close to 0. This is because there were few similarities between the control and treatment groups. The treatment group comprises people eligible for the government-supported social program; therefore, they often represent the low-income part of the population. In contrast, the PSID sample control group represents the average household in the United States; therefore, their characteristics are quite different from those of the treatment group.
We used the estimated propensity score of each group as our data and applied the kernel method to estimate density functions to compare the density functions of the propensity score. Figure 3 shows the density estimated using the Gaussian kernel and by applying [3]’s rule of thumb to select the optimal bandwidth for each sample.
The propensity score is a probability, so its density function estimate should be within the unit interval. However, the density estimates the flow outside the unit interval. This is because many individuals have propensity scores close to the boundaries of either 0 or 1. Because the kernel method is a smoothing technique, it allows the information carried by the data points to flow outside. A significant mass is generated outside the unit interval when many observations are near the boundaries. This is a conceptual problem because we compare two probability density functions. However, it also causes a boundary bias problem because it underestimates the density value near the boundary.
As a remedy, we use the exact kernel on a finite domain in Proposition 2:
f ^ ( x ) = 1 + 2 n i = 1 n m = 1 exp m 2 π 2 t 2 cos ( m π X i ) cos ( m π x ) .
We applied this to the propensity score for each group and obtained Figure 4. Unlike Figure 3, the results in Figure 4 show substantial masses clustered at the boundaries. For the control group, where most observations are close to 0, the estimated value of the density function on the left side is approximately twice that in Figure 4 than that in Figure 3. This indicates that one would have an incorrect inference without a proper mechanism to fix the boundary effects.

9. Conclusions

In this study, we developed a method for obtaining exact boundary kernels that satisfy the boundary conditions of the density function of interest. This method was extended to multivariate density estimation of compact supports. A naive density estimation method may induce a significant boundary bias. The proposed method corrects the boundary bias problem by providing a density estimate adequately supported in the finite domain. We expect this new approach will be helpful for classification, clustering, and discriminant analysis in which nonparametric density estimation plays a central role.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank the anonymous reviewers who helped improve the paper’s readability and quality.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proofs of Main Results

Appendix A.1. Proof of Proposition 1

Proof. 
We use the Laplace transform of ϕ ( x , t ) with respect to time t: Φ ( x , s ) = 0 e s t ϕ ( x , t ) d t . Taking the Laplace transform on both sides of Equation (3),
0 e s t ϕ t d t = 0 1 2 e s t ϕ x x d t e s t ϕ | 0 + s 0 e s t ϕ d t = 1 2 0 e s t ϕ x x d t ϕ ( x , 0 ) + s Φ ( x , s ) = 1 2 Φ x x ( x , s ) .
The problem is converted into a simple second-order ordinary differential equation and the solution is
Φ ( x , s ) = 1 2 s g ( u ) exp ( 2 s | x u | ) d u .
An inverse Laplace transform is used. The inverse Laplace transform of the double exponential 1 / s e k s is the Gaussian kernel 1 / π t e k 2 / 4 t , with which we obtain
ϕ ( x , t ) = 1 2 π t g ( u ) exp ( ( x u ) 2 2 t ) d u .

Appendix A.2. Proof of Proposition 2

Proof. 
By separating the variables, we search for solutions of the form ϕ ( x , t ) = X ( x ) T ( t ) . By substituting X ( x ) T ( t ) into the heat equation, we obtain the following two ordinary differential equations:
T ( t ) + λ 2 2 T ( t ) = 0 X ( x ) + λ 2 X ( x ) = 0
where T ( t ) = C e λ 2 t / 2 , X ( x ) = B sin ( λ x ) + A cos ( λ x ) , and A , B , C are arbitrary constants. The boundary conditions ϕ x ( 0 , t ) = T ( t ) X ( 0 ) = ϕ x ( l , t ) = T ( t ) X ( l ) = 0 imply that X ( 0 ) = X ( l ) = 0 , and the possible eigenvalues are
λ = m π l 2 , m = 0 , 1 , 2 ,
The corresponding eigenfunctions are
X ( x ) = A cos m π x l , m = 0 , 1 , 2 ,
This means that the solution has the following product form:
ϕ ( x , t ) = A 0 + m = 1 A m cos m π x l exp m 2 π 2 t 2 l 2 .
We determine the coefficients A m , m = 0 , 1 , 2 , such that the initial condition is fulfilled by the infinite series
g ( x ) = ϕ ( x , t ) = A 0 + m = 1 A m cos m π x l .
The formulation above implies that A m , m = 0 , 1 , 2 , is determined by the Fourier coefficients
A 0 = 1 l 0 l g ( u ) d u , A m = 2 l 0 l g ( u ) cos m π x l d u .
Let the initial condition be given by the data, g ( u ) = 1 / n i = 1 n δ ( u X i ) . We have the Fourier coefficients
A 0 = 1 l 1 n i = 1 n 0 l δ ( u X i ) d u = 1 l 1 n · n = 1 l , A m = 2 l 0 l 1 n i = 1 n δ ( u X i ) cos m π x l d u = 2 l n i = 1 n cos m π X i l .
We plug the coefficients into the solution (A1) to see the convolution representation and obtain
ϕ ( x , t ) = 1 l + m = 1 2 l n i = 1 n cos m π X i l cos m π x l exp m 2 π 2 t 2 l 2 .
Changing the order of summations,
ϕ ( x , t ) = 1 l + 2 l n i = 1 n m = 1 cos m π X i l cos m π x l exp m 2 π 2 t 2 l 2 .
This is the kernel density estimate f ^ ( x , t ) . The solution can be represented as
ϕ ( x , t ) = H ( x , u , t ) d F n ( u ) ,
where the kernel function is
H ( x , u , t ) = 1 l + 2 n l m = 1 cos m π u l cos m π x l exp m 2 π 2 t 2 l 2 .

Appendix A.3. Proof of Proposition 3

Proof. 
Using the separation of variable ϕ ( x , t ) = X ( x ) T ( t ) , substituting it into the heat equation, and using the boundary conditions X ( 0 ) = X ( l ) = 0 , the eigenvalues are
λ = ( m π l ) 2 , m = 1 , 2 , 3 , .
The corresponding eigenfunctions are
X ( x ) = B sin m π x l , m = 1 , 2 , 3 , .
The solution is given by
ϕ ( x , t ) = m = 1 B m sin m π x l exp m 2 π 2 t 2 l 2 .
The initial condition implies that
g ( x ) = ϕ ( x , t ) = m = 1 B m sin m π x l .
Therefore, the coefficients B m , m = 1 , 2 , 3 , should be the Fourier coefficients for the above problem:
B m = 2 l 0 l g ( u ) sin m π u l .
We examine the solution to the problem by substituting the coefficients B m into (A3):
ϕ ( x , t ) = m = 1 2 l 0 l g ( u ) sin m π u l d u sin m π x l exp m 2 π 2 t 2 l 2 .
By changing the summation and integration, we obtain
ϕ ( x , t ) = 2 l 0 l g ( u ) m = 1 sin m π u l sin m π x l exp m 2 π 2 t 2 l 2 d u .
Therefore, we consider Equation (A4) as
ϕ ( x , t ) = H ( x , u , t ) g ( u ) d u ,
where kernel H ( x , u , t ) is given by
H ( x , u , t ) = 2 l m = 1 sin m π u l sin m π x l exp ( m 2 π 2 t 2 l 2 ) .
An empirical measure is given as an initial condition,
ϕ ( x , t ) = H ( x , u , t ) d F n ( u ) = 2 l 0 l 1 n i = 1 n δ ( u X i ) m = 1 sin m π u l sin m π x l exp m 2 π 2 t 2 l 2 d u = 2 n l i = 1 n m = 1 sin m π X i l sin m π x l exp m 2 π 2 t 2 l 2 .

Appendix A.4. Proof of Proposition 4

Proof. 
The solution is divided into two parts: steady-state and transient. See [21] for further discussion. Let ψ ( x , t ) be the transient solution; then, the solution of the original problem can be decomposed as
ϕ ( x , t ) = α + ( β α ) x l + ψ ( x , t ) ,
where ψ ( x , t ) satisfies the diffusion equation under simple boundary conditions.
ψ t ( x , t ) = 1 2 ψ x x ( x , t ) , 0 < x < l , t > 0 ψ ( 0 , t ) = 0 , ψ ( l , t ) = 0 , t > 0 .
However, it has adjusted initial conditions:
ψ ( x , 0 ) = g ( x ) α + ( β α ) x l , 0 < x < l .
The solution to this problem is presented in Proposition 3. We apply the heat kernel in (14) but with a modified initial condition and plug d F n ( x ) into the initial condition g ( x ) :
ψ ( x , t ) = 2 l m = 1 0 l 1 n i = 1 n δ ( u X i ) ( α + β α l u ) sin m π u l d u sin m π x l e ( m π l ) 2 t = 2 l m = 1 1 n i = 1 n sin m π X i l 0 l ( α + β α l u ) sin m π u l d u sin m π x l e ( m π l ) 2 t .
The solution is recovered as ϕ ( x , t ) = [ α + ( β α ) x l ] + ψ ( x , t ) . We calculated the following to simplify the solution:
α 0 l sin m π u l d u = α m π u l 0 m π sin ( u ) d u = α m π u l ( 1 cos ( m π ) ) β α l 0 l u sin m π u l d u = ( β α ) l ( m π ) 2 0 m π u sin ( u ) d u = ( β α ) l ( m π ) 2 sin ( m π ) ( β α ) l m π cos ( m π ) .
Once everything was added to the solution,
ϕ ( x , t ) = α + ( β α ) x l 2 l m = 1 1 n i = 1 n sin m π X i l l m π ( α β cos ( m π ) + β α m π sin ( m π ) ) sin m π x l e ( m π l ) 2 t = α + ( β α ) x l + 2 l m = 1 1 n i = 1 n sin m π X i l l m π ( α β ( 1 ) m ) sin m π x l e ( m π l ) 2 t .

Appendix A.5. Proof of Proposition 5

Proof. 
The proof is similar to that of Proposition 4; therefore, it is omitted to save space. □

Appendix A.6. Proof of Proposition 6

Proof. 
The proof is similar to that of Proposition 4 and is omitted. □

Appendix A.7. Proof of Proposition 7

Proof. 
This was sufficient to show that the corresponding kernel functions were equivalent. The proof presented here is based on [16]. Define
k ( x , t ) = 1 2 π t e x 2 / 2 t θ ( x , t ) = m = k ( x + 2 m l , t ) .
Ref. [16] (page 90) showed that as a function of x, θ ( x , t ) can be represented as the following Fourier series expansion:
θ ( x , t ) = 1 2 l m = e m π l i x 1 2 ( m π l ) 2 t
Rearranging the terms,
θ ( x , t ) = 1 2 l + 1 2 l m = 1 e m π l i x 1 2 ( m π l ) 2 t + 1 2 l m = 1 e m π l i x 1 2 ( m π l ) 2 t = 1 2 l + 1 2 l m = 1 e 1 2 ( m π l ) 2 t e m π l i x + e m π l i x = 1 2 l + 1 2 l m = 1 e 1 2 ( m π l ) 2 t 2 cos m π l x = 1 2 l + 1 l m = 1 e 1 2 ( m π l ) 2 t cos m π l x .
The kernel of the infinite reflection method can be represented by θ ( x + u , t ) + θ ( x u , t ) to prove the first part of the Proposition:
θ ( x + u , t ) + θ ( x u , t ) = 1 l + 1 l m = 1 e 1 2 ( m π l ) 2 t cos m π l ( x u ) + cos m π l ( x + u ) = 1 l + 1 l m = 1 e 1 2 ( m π l ) 2 t 2 cos ( m π l x ) cos ( m π l u ) = 1 l + 2 l m = 1 e 1 2 ( m π l ) 2 t cos ( m π l x ) cos ( m π l u ) .
To prove the second part of the Proposition, note that θ ( x + u , t ) θ ( x u , t ) is the kernel from the infinite negative reflection, and by the same logic,
θ ( x + u , t ) θ ( x u , t ) = 2 l m = 1 e 1 2 ( m π l ) 2 t sin ( m π l x ) sin ( m π l u ) .

Appendix A.8. Proof of Theorem 1

Proof. 
For ease of exposition, consider a case with k = 2 . A more general case can be proved in the same way. Apply the separation of variables. Let ϕ ( x 1 , x 2 ) = X ( x 1 ) Y ( x 2 ) and from (25),
X ( x 1 ) X ( x 1 ) + Y ( x 2 ) Y ( x 2 ) = λ .
The first and second terms on the left-hand side are functions of different variables. Let λ = μ 2 + ν 2 and by applying boundary conditions, obtain two separate equations:
X ( x 1 ) + μ 2 X ( x 1 ) = 0 , X ( 0 ) = X ( a 1 ) = 0 , Y ( x 2 ) + ν 2 Y ( x 2 ) = 0 , Y ( 0 ) = Y ( a 2 ) = 0 .
This reduces the two-dimensional problem to two one-dimensional problems with solutions
μ m 1 = m 1 π a 1 , X m 1 ( x 1 ) = sin m 1 π x 1 a 1 ν m 2 = m 2 π a 2 , Y m 2 ( x 2 ) = sin m 2 π x 2 a 2 ,
where m 1 , m 2 = 1 , 2 , 3 , . This means that the eigenvalues and eigenfunctions are
λ m 1 m 2 = m 1 π a 1 2 + m 2 π a 2 2 , φ m 1 m 2 = sin m 1 π x 1 a 1 sin m 2 π x 2 a 2 .
To ensure that the set of eigenfunctions { φ m 1 m 2 } m 1 , m 2 = 1 are orthogonal, calculate
0 a 2 0 a 1 φ m 1 m 2 ( x 1 , x 2 ) φ m 1 m 2 ( x 1 , x 2 ) d x 1 d x 2 = 0 a 1 sin m 1 π x 1 a 1 sin m 1 π x 1 a 1 d x 1 · 0 a 2 sin m 2 π x 2 a 2 sin m 2 π x 2 a 2 d x 2 = 0 when m 1 m 1 or m 2 m 2 a 1 a 2 4 when m 1 = m 1 and m 2 = m 2 .
Dividing each φ m 1 m 2 by a 1 a 2 / 4 , one obtains the normalized eigenfunctions. This leads to the desired kernel function H ( x , u , t ) . □

Appendix A.9. Proof of Theorem 2

Proof. 
For ease of exposition, continue to study the case with k = 2 . A more general case can be proved in the same way. By separation of variables and boundary conditions, one can obtain the following equations:
X ( x 1 ) + μ 2 X ( x 1 ) = 0 , X ( 0 ) = X ( a 1 ) = 0 , Y ( x 2 ) + ν 2 Y ( x 2 ) = 0 , Y ( 0 ) = Y ( a 2 ) = 0 .
Their solutions are
μ m 1 = m 1 π a 1 , X m 1 ( x 1 ) = cos m 1 π x 1 a 1 , ν m 2 = m 2 π a 2 , Y m 2 ( x 2 ) = cos m 2 π x 2 a 2 ,
where m 1 , m 2 = 0 , 1 , 2 , Note that the index starts from 0. The corresponding eigenvalues and eigenfunctions are
λ m 1 m 2 = m 1 π a 1 2 + m 2 π a 2 2 , φ m 1 m 2 = cos m 1 π x 1 a 1 cos m 2 π x 2 a 2 .
To check the orthogonality condition and obtain normalization constants, calculate
0 a 2 0 a 1 φ m 1 m 2 ( x 1 , x 2 ) φ m 1 m 2 ( x 1 , x 2 ) d x 1 d x 2 = 0 a 1 cos m 1 π x 1 a 1 cos m 1 π x 1 a 1 d x 1 · 0 a 2 cos m 2 π x 2 a 2 cos m 2 π x 2 a 2 d x 2 = 0 when m 1 m 1 or m 2 m 2 and m 1 0 and m 2 0 B ¯ m 1 m 2 when m 1 = m 1 and m 2 = m 2 .
To calculate constants, consider four possible cases, each corresponding to two possibilities in the indices m 1 = 0 , m 1 1 and m 2 = 0 , m 2 1 :
B ¯ m 1 m 2 = a 1 a 2 when m 1 = 0 and m 2 = 0 a 1 a 2 2 when m 1 = 0 and m 2 1 a 1 a 2 2 when m 1 1 and m 2 = 0 a 1 a 2 4 when m 1 1 and m 2 1 .
Using B ¯ m 1 m 2 , we can normalize the eigenfunctions φ m 1 m 2 . This leads to the desired kernel function H ( x , u , t ) . □

References

  1. Ai, C. A semiparametric maximum likelihood estimator. Econometrica 1997, 65, 933–963. [Google Scholar] [CrossRef]
  2. Klein, R.W.; Spady, R.H. An efficient semiparametric estimator for binary response models. Econometrica 1993, 61, 387–421. [Google Scholar] [CrossRef]
  3. Silverman, B.W. Density Estimation for Statistics and Data Analysis; CRC Press: London, UK, 1986. [Google Scholar]
  4. Cowling, A.; Hall, P. On pseudodata methods for removing boundary effects in kernel density estimation. J. Roy. Stat. Soc. B 1996, 58, 551–563. [Google Scholar] [CrossRef]
  5. Gasser, T.; Müller, H.G.; Mammitzsch, V. Kernels for nonparametric curve estimation. J. Roy. Stat. Soc. B 1985, 47, 238–252. [Google Scholar] [CrossRef]
  6. Müller, H.G. Smooth optimum kernel estimators near endpoints. Biometrika 1991, 78, 521–530. [Google Scholar] [CrossRef]
  7. Marron, J.S.; Ruppert, D. Transformations to reduce boundary bias in kernel density estimation. J. Roy. Stat. Soc. B 1994, 56, 653–671. [Google Scholar] [CrossRef]
  8. Wand, M.P.; Marron, J.S.; Ruppert, D. Transformations in density estimation. J. Am. Stat. Assoc. 1991, 86, 343–353. [Google Scholar] [CrossRef]
  9. Botev, Z.; Grotowski, J.; Kroese, D. Kernel density estimation via diffusion. Ann. Stat. 2010, 38, 2916–2957. [Google Scholar] [CrossRef]
  10. Koenker, R.; Mizera, I.; Yoon, J. What do kernel density estimators optimize? J. Econom. Methods 2012, 1, 15–22. [Google Scholar] [CrossRef]
  11. Asta, D.M. Kernel density estimation on symmetric spaces of non-compact type. J. Multivar. Anal. 2021, 181, 104676. [Google Scholar] [CrossRef]
  12. Colbrook, M.J.; Botev, Z.I.; Kuritz, K.; MacNamara, S. Kernel density estimation with linked boundary conditions. Stud. Appl. Math. 2020, 145, 357–396. [Google Scholar] [CrossRef]
  13. Carrier, G.F.; Pearson, C.E. Partial Differential Equations: Theory and Technique, 2nd ed.; Academic Press: Boston, MA, USA, 1988. [Google Scholar]
  14. Haberman, R. Elementary Applied Partial Differential Equations: With Fourier Series and Boundary Value Problems, 1st ed.; Prentice-Hall: Englewood Cliffs, NJ, USA, 1983. [Google Scholar]
  15. Strauss, W.A. Partial Differential Equations: An Introduction, 1st ed.; John Wiley & Sons: New York, NY, USA, 1992. [Google Scholar]
  16. Widder, D.V. The Heat Equation, 1st ed.; Academic Press: New York, NY, USA, 1973. [Google Scholar]
  17. Boneva, L.I.; Kendall, D.; Stefanov, I. Spline transformations: Three new diagnostic aids for the statistical data-analyst. J. Roy. Stat. Soc. B 1971, 33, 1–71. [Google Scholar] [CrossRef]
  18. McOwen, R.C. Partial Differential Equations: Methods and Applications, 2nd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
  19. Heckman, J.J.; Ichimura, H.; Smith, J.A.; Todd, P.E. Characterizing selection bias using experimental data. Econometrica 1998, 66, 1017–1098. [Google Scholar] [CrossRef]
  20. LaLonde, R.J. Evaluating the econometric evaluations of training programs with experimental data. Am. Econ. Rev. 1986, 76, 604–620. [Google Scholar]
  21. Farlow, S.J. Partial Differential Equations for Scientists and Engineers; Dover Publications: New York, NY, USA, 1993. [Google Scholar]
Figure 1. Diffusion of information at a point mass as time passes.
Figure 1. Diffusion of information at a point mass as time passes.
Symmetry 15 01670 g001
Figure 2. Histograms of the estimated propensity scores. Note that there are many occasions when the estimated probabilities are close to boundaries. Especially in a control group, almost all individuals’ propensity score is close to 0.
Figure 2. Histograms of the estimated propensity scores. Note that there are many occasions when the estimated probabilities are close to boundaries. Especially in a control group, almost all individuals’ propensity score is close to 0.
Symmetry 15 01670 g002
Figure 3. Kernel density estimates of the propensity scores. No boundary effects are considered. Note that density estimates flow outside of the unit interval. It results in underestimation near the boundary.
Figure 3. Kernel density estimates of the propensity scores. No boundary effects are considered. Note that density estimates flow outside of the unit interval. It results in underestimation near the boundary.
Symmetry 15 01670 g003
Figure 4. Modified kernel density estimates of the propensity scores. An exact kernel function with zero derivative boundary condition is used. The kernel makes the density function derivative zero at both boundary sides. Note that there are significant masses clustered at the boundary. It fixes the underestimation problem of the conventional kernel method.
Figure 4. Modified kernel density estimates of the propensity scores. An exact kernel function with zero derivative boundary condition is used. The kernel makes the density function derivative zero at both boundary sides. Note that there are significant masses clustered at the boundary. It fixes the underestimation problem of the conventional kernel method.
Symmetry 15 01670 g004
Table 1. Bias and RMSE for Model 1.
Table 1. Bias and RMSE for Model 1.
BiasRMSE
f ^ n f ^ r f ^ dr f ^ e f ^ n f ^ r f ^ dr f ^ e
n = 30
   h = 0.1 0.498  0.003 0.003 0.003 0.535 0.392 0.392 0.392
   h = 0.3 0.500  0.000 0.001 0.001 0.507 0.171 0.171 0.171
   h = 0.5 0.523 0.045 0.023  0.000 0.524 0.096 0.083 0.075
   h = 0.7 0.577 0.153 0.079  0.000 0.577 0.159 0.085 0.023
   h = 1.0 0.659 0.317 0.181  0.000 0.659 0.318 0.182 0.002
n = 50
   h = 0.1 0.502 0.005 0.005 0.005  0.525 0.308 0.308 0.308
   h = 0.3 0.501 0.003 0.002 0.002  0.506 0.134 0.134 0.134
   h = 0.5 0.523 0.047 0.024 0.001  0.524 0.081 0.067 0.059
   h = 0.7 0.577 0.154 0.079  0.000 0.577 0.157 0.083 0.018
   h = 1.0 0.659 0.318 0.182  0.000 0.659 0.318 0.182 0.001
n = 100
   h = 0.1 0.500  0.000 0.000 0.000 0.512 0.215 0.215 0.215
   h = 0.3 0.500  0.000 0.000 0.001 0.502 0.094 0.094 0.094
   h = 0.5 0.523 0.045 0.022  0.000 0.523 0.064 0.049 0.041
   h = 0.7 0.576 0.153 0.079  0.000 0.577 0.155 0.080 0.013
   h = 1.0 0.659 0.317 0.181  0.000 0.659 0.317 0.181 0.001
Biases (Bias) and root mean squared errors (RMSEs) for the density estimators evaluated at 0, the left end point of the support. Definitions of the four estimators, f ^ n , f ^ r , f ^ d r , f ^ e , can be found in the main text. The number of simulation repetitions is 10,000. The sample size n and the bandwidth h used in simulation exercises can be found in the table.
Table 2. Bias and RMSE for Model 2.
Table 2. Bias and RMSE for Model 2.
BiasRMSE
f ^ n f ^ r f ^ dr f ^ e f ^ n f ^ r f ^ dr f ^ e
n = 30
   h = 0.1 0.588 0.007 0.007 0.007  0.623 0.414 0.414 0.414
   h = 0.3 0.608 0.048 0.048 0.047  0.614 0.179 0.179 0.179
   h = 0.5 0.659 0.149 0.130 0.111  0.660 0.170 0.151 0.133
   h = 0.7 0.729 0.288 0.222 0.151  0.729 0.291 0.224 0.153
   h = 1.0 0.820 0.472 0.344 0.167  0.820 0.472 0.344 0.167
n = 50
   h = 0.1 0.587 0.005 0.005 0.005  0.609 0.322 0.322 0.322
   h = 0.3 0.610 0.051 0.050 0.050  0.613 0.144 0.144 0.143
   h = 0.5 0.660 0.150 0.131 0.112  0.660 0.163 0.145 0.126
   h = 0.7 0.729 0.289 0.223 0.151  0.729 0.291 0.224 0.152
   h = 1.0 0.821 0.472 0.344 0.167  0.821 0.473 0.344 0.167
n = 100
   h = 0.1 0.588 0.007 0.007 0.007  0.599 0.228 0.228 0.228
   h = 0.3 0.610 0.050 0.050 0.050  0.611 0.107 0.107 0.107
   h = 0.5 0.659 0.150 0.131 0.112  0.660 0.157 0.138 0.119
   h = 0.7 0.729 0.289 0.223 0.151  0.729 0.290 0.223 0.152
   h = 1.0 0.821 0.472 0.344 0.167  0.821 0.472 0.344 0.167
Biases (Bias) and root mean squared errors (RMSEs) for the density estimators evaluated at 0, the left end point of the support. See Table 1 for further details.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.-Y. Exact Boundary Correction Methods for Multivariate Kernel Density Estimation. Symmetry 2023, 15, 1670. https://doi.org/10.3390/sym15091670

AMA Style

Yang J-Y. Exact Boundary Correction Methods for Multivariate Kernel Density Estimation. Symmetry. 2023; 15(9):1670. https://doi.org/10.3390/sym15091670

Chicago/Turabian Style

Yang, Ji-Yeon. 2023. "Exact Boundary Correction Methods for Multivariate Kernel Density Estimation" Symmetry 15, no. 9: 1670. https://doi.org/10.3390/sym15091670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop