Next Article in Journal
Reduction of Markov Chains Using a Value-of-Information-Based Approach
Next Article in Special Issue
New Developments in Statistical Information Theory Based on Entropy and Divergence Measures
Previous Article in Journal
Transfer Information Assessment in Diagnosis of Vasovagal Syncope Using Transfer Entropy
Previous Article in Special Issue
Composite Tests under Corrupted Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family

1
Department of Statistics, George Mason University, Fairfax, VA 22030, USA
2
Department of Mathematics and Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(4), 348; https://doi.org/10.3390/e21040348
Submission received: 6 February 2019 / Revised: 23 March 2019 / Accepted: 24 March 2019 / Published: 29 March 2019

Abstract

:
Big data and streaming data are encountered in a variety of contemporary applications in business and industry. In such cases, it is common to use random projections to reduce the dimension of the data yielding compressed data. These data however possess various anomalies such as heterogeneity, outliers, and round-off errors which are hard to detect due to volume and processing challenges. This paper describes a new robust and efficient methodology, using Hellinger distance, to analyze the compressed data. Using large sample methods and numerical experiments, it is demonstrated that a routine use of robust estimation procedure is feasible. The role of double limits in understanding the efficiency and robustness is brought out, which is of independent interest.

1. Introduction

Streaming data are commonly encountered in several business and industrial applications leading to the so-called Big Data. These are commonly characterized using four V’s: velocity, volume, variety, and veracity. Velocity refers to the speed of data processing while volume refers to the amount of data. Variety refers to various types of data while veracity refers to uncertainty and imprecision in data. It is believed that veracity is due to data inconsistencies, incompleteness, and approximations. Whatever be the real cause, it is hard to identify and pre-process data for veracity in a big data setting. The issues are even more complicated when the data are streaming.
A consequence of the data veracity is that statistical assumptions used for analytics tend to be inaccurate. Specifically, considerations such as model misspecification, statistical efficiency, robustness, and uncertainty assessment-which are standard part of a statistical toolkit-cannot be routinely carried out due to storage limitations. Statistical methods that facilitate simultaneous addressal of twin problems of volume and veracity would enhance the value of the big data. While health care industry and financial industries would be the prime benefactors of this technology, the methods can be routinely applied in a variety of problems that use big data for decision making.
We consider a collection of n (n is of the order of at least 10 6 ) observations, assumed to be independent and identically distributed (i.i.d.), from a probability distribution f ( · ) belonging to a location-scale family; that is,
f ( x ; μ , σ ) = 1 σ f x μ σ , μ I R , σ > 0 .
We denote by Θ the parameter space and without loss of generality take it as compact since otherwise it can be re-parametrized in a such a way that the resulting parameter space is compact (see [1]).
The purpose of this paper is to describe a methodology for joint robust and efficient estimation of μ and σ 2 that takes into account (i) storage issues, (ii) potential model misspecifications, and (iii) presence of aberrant outliers. These issues-which are more likely to occur when dealing with massive amounts of data-if not appropriately accounted in the methodological development, can lead to inaccurate inference and misleading conclusions. On the other hand, incorporating them in the existing methodology may not be feasible due to a computational burden.
Hellinger distance-based methods have long been used to handle the dual issue of robustness and statistical efficiency. Since the work of [1,2] statistical methods that invoke alternative objective functions which converge to the objective function under the posited model have been developed and the methods have been shown to possess efficiency and robustness. However, their routine use in the context of big data problems is not feasible due to the complexity in the computations and other statistical challenges. Recently, a class of algorithms-referred to as Divide and Conquer—have been developed to address some of these issues in the context of likelihood. These algorithms consist in distributing the data across multiple processors and, in the context of the problem under consideration, estimating the parameters from each processor separately and then combining them to obtain an overall estimate. The algorithm assumes availability of several processors, with substantial processing power, to solve the complex problem at hand. Since robust procedures involve complex iterative computations-invoking the increased demand for several high-speed processors and enhanced memory-routine use of available analytical methods in a big data setting is challenging. Maximum likelihood method of estimation in the context of location-scale family of distributions has received much attention in the literature ([3,4,5,6,7]). It is well-known that the maximum likelihood estimators (MLE) of location-scale families may not exist unless the defining function f ( · ) satisfies certain regularity conditions. Hence, it is natural to ask if other methods of estimation such as minimum Hellinger distance estimator(MHDE) under weaker regularity conditions. This manuscript provides a first step towards addressing this question. Random projections and sparse random projections are being increasingly used to “compress data” and then use the resulting compressed data for inference. The methodology, primarily developed by computer scientists, is increasingly gaining attention among the statistical community and is investigated in a variety of recent work ([8,9,10,11,12]). In this manuscript, we describe a Hellinger distance-based methodology for robust and efficient estimation after the use of random projections for compressing i.i.d data belonging to the location-scale family. The proposed method consists in reducing the dimension of the data to facilitate the ease of computations and simultaneously maintain robustness and efficiency when the posited model is correct. While primarily developed to handle big and streaming data, the approach can also be used to handle privacy issues in a variety of applications [13].
The rest of the paper is organized as follows: Section 2 provides background on minimum Hellinger distance estimation; Section 3 is concerned with the development of Hellinger distance-based methods for compressed data obtained after using random projections; additionally, it contains the main results and their proofs. Section 4 contains results of the numerical experiments and also describes an algorithm for implementation of the proposed methods. Section 5 contains a real data example from financial analytics. Section 6 is concerned with discussions and extensions. Section 7 contains some concluding remarks.

2. Background on Minimum Hellinger Distance Estimation

Ref. [1] proposed minimum Hellinger distance (MHD) estimation for i.i.d. observations and established that MHD estimators (MHDE) are simultaneously robust and first-order efficient under the true model. Other researchers have investigated related estimators, for example, [14,15,16,17,18,19,20]. These authors establish that when the model is correct, the MHDE is asymptotically equivalent to the maximum likelihood estimator (MLE) in a variety of independent and dependent data settings. For a comprehensive discussion of minimum divergence theory see [21].
We begin by recalling that the Hellinger distance between two probability densities is the L 2 distance between the square root of the densities. Specifically, let, for p 1 , | | · | | p denote the L p norm defined by
| | h | | p = | h | p 1 / p .
The Hellinger distance between the densities f ( · ) and g ( · ) is given by
H 2 ( f ( · ) , g ( · ) ) = | | f 1 / 2 ( · ) g 1 / 2 ( · ) | | 2 2 .
Let f ( · | θ ) denote the density of I R d valued independent and identically distributed random variables X 1 , , X n , where θ Θ I R p ; let g n ( · ) be a nonparametric density estimate (typically a kernel density estimator). The Hellinger distance between f ( · | θ ) and g n ( · ) is then
H 2 f ( · | θ ) , g n ( · ) = | | f 1 / 2 ( · | θ ) g n 1 / 2 ( · ) | | 2 2 .
The MHDE is a mapping T ( · ) from the set of all densities to I R p defined as follows:
θ g = T ( g ) = argmin θ Θ H 2 f ( · | θ ) , g ( · ) .
Please note that the above minimization problem is equivalent to maximizing A f ( · | θ ) , g ( · ) = f 1 / 2 ( x | θ ) g 1 / 2 ( x ) d x . Hence MHDE can alternatively be defined as
θ g = argmax θ Θ A f ( · | θ ) , g ( · ) .
To study the robustness of MHDE, ref. [1] showed that to assess the robustness of a functional with respect to the gross-error model it is necessary to examine the α -influence curve rather than the influence curve, except when the influence curve provides a uniform approximation to the α -influence curve. Specifically, the α -influence function ( IF α ( θ , z ) ) is defined as follows: for θ Θ , let f α , θ , z = ( 1 α ) f ( · | θ ) + α η z , where η z denotes the uniform density on the interval ( z ϵ , z + ϵ ) , where ϵ > 0 is small, α ( 0 , 1 ) , z I R ; the α -influence function is then defined to be
IF α ( θ , z ) = T ( f α , θ , z ) θ α ,
where T ( f α , θ , z ) is the functional for the model with density f α , θ , z ( · ) . Equation (2) represents a complete description of the behavior of the estimator in the presence of contamination, up to the shape of the contaminating density. If IF α ( θ , z ) is a bounded function of z such that lim z IF α ( θ , z ) = 0 , for every θ Θ , then the functional T is robust at f ( · | θ ) against 100 % α contamination by gross errors at arbitrary large value z. The influence function can be obtained by letting α 0 . Under standard regularity conditions, the minimum divergence estimators (MDE) are first order efficient and have the same influence function as the MLE under the model, which is often unbounded. Hence the robustness of these estimators cannot be explained through their influence functions. In contrast, the α -influence function of the estimators are often bounded, continuous functions of the contaminating point. Finally, this approach often leads to high breakdown points in parametric estimation. Other explanations can also be found in [22,23].
Ref. [1] showed that the MHDE of location has a breakdown point equal to 50 % . Roughly speaking, the breakdown point is the smallest fraction of data that, when strategically placed, can cause an estimator to take arbitrary values. Ref. [24] obtained breakdown results for MHDE of multivariate location and covariance. They showed that the affine-invariant MHDE for multivariate location and covariance has a breakdown point of at least 25 % . Ref. [18] showed that the MHDE has 50 % breakdown in some discrete models.

3. Hellinger Distance Methodology for Compressed Data

In this section we describe the Hellinger distance-based methodology as applied to the compressed data. Since we are seeking to model the streaming independent and identically distributed data, we denote by J the number of observations in a fixed time-interval (for instance, every ten minutes, or every half-hour, or every three hours). Let B denote the total number of time intervals. Alternatively, B could also represent the number of sources from which the data are collected. Then, the incoming data can be expressed as { X j l , 1 j J ; 1 l B } . Throughout this paper, we assume that the density of X j l belongs to a location-scale family and is given by f ( x ; θ * ) = 1 σ * f ( x μ * σ * ) , where θ * = ( μ * , σ * ) . A typical example is a data store receiving data from multiple sources, for instance financial or healthcare organizations, where information from multiple sources across several hours are used to monitor events of interest such as cumulative usage of certain financial instruments or drugs.

3.1. Random Projections

Let R l = ( r i j l ) be a S × J matrix, where S is the number of compressed observations in each time interval, S J , and r i j l ’s are independent and identically distributed random variables and assumed to be independent of { X j l , j = 1 , 2 , , J ; 1 l B } . Let
Y ˜ i l = j = 1 J r i j l X j l
and set Y l ˜ = ( Y ˜ 1 l , , Y ˜ S l ) ; in matrix form this can be expressed as Y ˜ l = R l X l . The matrix R l is referred to as the sensing matrix and { Y ˜ i l , i = 1 , 2 , S ; l = 1 , 2 , , B } is referred to as the compressed data. The total number of compressed observations m = S B is much smaller than the number of original observations n = J B . We notice here that R l ’s are independent and identically distributed random matrices of order S × J . Referring to each time interval or a source as a group, the following Table 1 is a tabular representation of the compressed data.
In random projections literature, the distribution of r i j l is typically taken to be Gaussian; but other distributions such as Rademacher distribution, exponential distribution and extreme value distributions are also used (for instance, see [25]). In this paper, we do not make any strong distributional assumptions on r i j l . We only assume that E r i j l = 1 and Var r i j l = γ 0 2 , where E [ · ] represents the expectation of the random variable and Var · represents the variance of the random variable. Additionally, we denote the density of r i j l by q ( · ) .
We next return to the storage issue. When S = 1 and r i j l = 1 , Y ˜ i l is a sum of J random variables. In this case, one retains (stores) only the sum of J observations and robust estimates of θ * are sought using the sum of observations. In other situations, that is when r i j l are not degenerate at 1, the distribution of Y ˜ i l is complicated. Indeed, even if r i j l are assumed to be normally distributed, the marginal distribution of Y ˜ i l is complicated. The conditional distribution is Y ˜ i l (given r i j l ) is a weighted sum of location scale distributions and does not have a useful closed form expression. Hence, in general for these problems the MLE method is not feasible. We denote by ω i l 2 = j = 1 J r i j l 2 and work with the random variables Y i l ω i l 1 Y ˜ i l . We denote the true density of Y i l to be h J ( · | θ * , γ 0 ) . Also, when γ 0 = 0 (which implies r i j l 1 ) we denote the true density of Y i l by h * J ( · | θ * ) to emphasize that the true density is a convolution of J independent and identically distributed random variables.

3.2. Hellinger Distance Method for Compressed Data

In this section, we describe the Hellinger distance-based method for estimating the parameters of the location scale family using the compressed data. As described in the last section, let { X j l , j = 1 , 2 , , J ; l = 1 , 2 , , B } be a doubly indexed collection of independent and identically distributed random variables with true density 1 σ * f · μ * σ * . Our goal is to estimate θ * = ( μ * , σ 2 * ) using the compressed data { Y i l , i = 1 , 2 , , S ; l = 1 , 2 , , B } . We re-emphasize here that the density of Y i l depends additionally on γ 0 , the variance of the sensing random variables r i j l .
To formulate the Hellinger-distance estimation method, let G be a class of densities metrized by the L 1 distance. Let { h J ( · | θ , γ 0 ) ; θ Θ } be a parametric family of densities. The Hellinger distance functional T is a measurable mapping mapping from G to Θ , defined as follows:
T ( g ) arg min θ I R g 1 2 ( y ) h J 1 2 ( y | θ , γ 0 ) 2 d y = arg min θ H D 2 g , h J ( · | θ , γ 0 ) = θ g * ( γ 0 ) .
When g ( · ) = h J ( · | θ * , γ 0 ) , then under additional assumptions θ g * ( γ 0 ) = θ * ( γ 0 ) . Since minimizing the Hellinger-distance is equivalent to maximizing the affinity, it follows that
T ( g ) = arg max θ A g , h J ( · | θ , γ 0 ) , where
A ( g , h J ( · | θ , γ 0 ) ) I R g 1 2 ( y ) h J 1 2 ( y | θ , γ 0 ) d y .
It is worth noticing here that
A ( g , h J ( · | θ , γ 0 ) ) = 1 1 2 H D 2 ( g , h J ( · | θ , γ 0 ) ) .
To obtain the Hellinger distance estimator of the true unknown parameters θ * , expectedly we choose the parametric family h J ( · | θ , γ 0 ) to be density of Y i l and g ( · ) to be a non-parametric L 1 consistent estimator g B ( · ) of h J ( · | θ , γ 0 ) . Thus, the MHDE of θ B * is given by
θ ^ B ( γ 0 ) = arg max θ A g B , h J ( · | θ , γ 0 ) = T ( g B ) .
In the notation above, we emphasize the dependence of the estimator on the variance of the projecting random variables. We notice here that the solution to (1) may not be unique. In such cases, we choose one of the solutions in a measurable manner.
The choice of the density estimate, typically employed in the literature is the kernel density estimate. However, in the setting of the compressed data investigated here, there are S observations per group. These S observations are, conditioned on r i j l independent; however they are marginally dependent (if S > 1 ). In the case when S > 1 , we propose the following formula for g B ( · ) . First, we consider the estimator
g B ( i ) ( y ) = 1 B c B l = 1 B K y Y i l c B , i = 1 , 2 , , S .
With this choice, the MHDE of θ B * is given by, for 1 i S ,
θ ^ i , B ( γ 0 ) = arg max θ A g B ( i ) , h J ( · | θ , γ 0 ) .
The above estimate of the density chooses i t h observation from each group and obtains the kernel density estimator using the B independent and identically distributed compressed observations. This is one choice for the estimator. Of course, alternatively, one could obtain S B different estimators by choosing different combinations of observations from each group.
It is well-known that the estimator is almost surely L 1 consistent for h J ( · | θ * , γ 0 ) as long as c B 0 and B c B as B . Hence, under additional regularity and identifiability conditions and further conditions on the bandwidth c B , existence, uniqueness, consistency and asymptotic normality of θ ^ i , B ( γ 0 ) , for fixed γ 0 , follows from the existing results in the literature.
When γ 0 = 0 and r i j l 1 , as explained previously, the true density is a J –fold convolution of f ( · | θ * ) , it is natural to ask the following question: if one lets γ 0 0 , will the asymptotic results converge to what one would obtain by taking γ 0 = 0 . We refer to this property as a continuity property in γ 0 of the procedure. Furthermore, it is natural to wonder if these asymptotic properties can be established uniformly in γ 0 . If that is the case, then one can also allow γ 0 to depend on B. This idea has an intuitive appeal since one can choose the parameters of the sensing random variables to achieve an optimal inferential scheme. We address some of these issues in the next subsection.
Finally, we emphasize here that while we do not require S > 1 , in applications involving streaming data and privacy problems S tends to greater than one. In problems where the variance of sensing variables are large, one can obtain an overall estimator by averaging θ ^ i , B ( γ 0 ) over various choices of 1 i S ; that is,
θ ^ B ( γ 0 ) = 1 S i = 1 S θ ^ i , B ( γ 0 ) .
The averaging improves the accuracy of the estimator in small compressed samples (data not presented). For this reason, we provide results for this general case, even though our simulation and theoretical results demonstrate that for some problems considered in this paper, S can be taken to be one. We now turn to our main results which are presented in the next subsection. The following Figure 1 provides a overview of our work.

3.3. Main Results

In this section we state our main results concerning the asymptotic properties of the MHDE of compressed data Y i l . We emphasize here that we only store { ( Y ˜ i l , r i · l , ω i l 2 ) : i = 1 , 2 , , S ; l = 1 , 2 , , B } . Specifically, we establish the continuity property in γ 0 of the proposed methods by establishing the existence of the iterated limits. This provides a first step in establishing the double limit. The first proposition is well-known and is concerned with the existence and uniqueness of MHDE for the location-scale family defined in () using compressed data.
Proposition 1.
Assume that h J ( · | θ , γ 0 ) is a continuous density function. Assume further that if θ 1 θ 2 . Then for every γ 0 0 , h J ( y | θ 1 , γ 0 ) h J ( y | θ 2 , γ 0 ) on a set of positive Lebesgue measure, the MHDE in (4) exists and is unique.
Proof. 
The proof follows from Theorem 2.2 of [20] since, without loss of generality, Θ is taken to be compact and the density function h J ( · | θ , γ 0 ) is continuous in θ . □
Consistency: We next turn our attention to consistency. As explained previously, under regularity conditions for each fixed γ 0 , the MHDE θ ^ i , B ( γ 0 ) is consistent for θ * ( γ 0 ) . The next result says that under additional conditions, the consistency property of MHDE is continuous in γ 0 .
Proposition 2.
Let h J ( · | θ , γ 0 ) be a continuous probability density function satisfying the conditions of Proposition 1. Assume that
lim γ 0 0 sup θ Θ I R | h J ( y | θ , γ 0 ) h * J ( y | θ ) | d y = 0 .
Then, with probability one (wp1) the iterated limits also exist and equals θ * ; that is, for 1 i S ,
lim B lim γ 0 0 θ ^ i , B ( γ 0 ) = lim γ 0 0 lim B θ ^ i , B ( γ 0 ) = θ * .
Proof. 
Without loss of generality let Θ be compact since otherwise it can be embedded into a compact set as described in [1]. Since f ( · ) is continuous in θ and g ( · ) is continuous in γ 0 , it follows that h J ( · | θ , γ 0 ) is continuous in θ and γ 0 . Hence by Theorem 1 of [1] for every fixed γ 0 0 and 1 i S ,
lim B θ ^ i , B ( γ 0 ) = θ * ( γ 0 ) .
Thus, to verify the convergence of θ * ( γ 0 ) to θ * as γ 0 0 , we first establish, using (6), that
lim γ 0 0 sup θ Θ | A ( h J ( · | θ , γ 0 ) , h * J ( · | θ ) ) 1 | = 0 .
To this end, we first notice that
sup θ Θ H D 2 ( h J ( · | θ , γ 0 ) , h * J ( · | θ ) ) sup θ Θ I R | ( h J ( y | θ , γ 0 ) h * J ( y | θ ) | d y .
Hence, using (3),
sup θ Θ | A ( h J ( · | θ , γ 0 ) , h * J ( · | θ ) ) 1 | = 1 2 sup θ Θ H D 2 ( h J ( · | θ , γ 0 ) , h * J ( · | θ ) ) 0 as γ 0 0 .
Hence,
lim γ 0 0 A ( h J ( · | θ * ( γ 0 ) , γ 0 ) , h * J ( · | θ * ( γ 0 ) ) ) = 1 .
Also, by continuity,
lim γ 0 0 A ( h * J ( · | θ * ( γ 0 ) , γ 0 ) , h * J ( · | θ * ) ) = 1 ,
which, in turn implies that
lim γ 0 0 A ( h J ( · | θ * ( γ 0 ) , γ 0 ) , h * J ( · | θ * ) ) = 1 .
Thus existence of the iterated limit first as B and then γ 0 0 follows using compactness of Θ and the identifiability of the model. As for the other iterated limit, again notice notice that for each 1 i S , A ( g B ( i ) , h J ( · | θ , γ 0 ) ) converges to A ( g B ( i ) , h * J ( · | θ ) ) with probability one as γ 0 converges to 0. The result then follows again by an application of Theorem 1 of [20]. □
Remark 1.
Verification of condition (6) seems to be involved even in the case of standard Gaussian random variables and standard Gaussian sensing random variables. Indeed in this case, the density of h J ( · | θ , γ 0 ) is a J fold convolution of a Bessel function of second kind. It may be possible to verify the condition (6) using the properties of these functions and compactness of the parameter space Θ. However, if one is focused only on weak-consistency, it is an immediate consequence of Theorems 1 and 2 below and condition (6) is not required. Finally, it is worth mentioning here that the convergence in (6) without uniformity over Θ is a consequence of convergence in probability of r i j l to 1 and Glick’s Theorem.
Asymptotic limit distribution: We now proceed to investigate the limit distribution of θ B * ( γ 0 ) as B and γ 0 0 . It is well-known that for fixed γ 0 0 , after centering and scaling, θ B * ( γ 0 ) has a limiting Gaussian distribution, under appropriate regularity conditions (see for instance [20]). However to evaluate the iterated limits as γ 0 0 and B , additional refinements of the techniques in [20] are required. To this end, we start with additional notations. Let s J ( · | θ , γ 0 ) = h J 1 2 ( · | θ , γ 0 ) and let the score function be denoted by u J ( · | θ , γ 0 ) log h J ( · | θ , γ 0 ) =   log   h J ( · | θ , γ 0 ) μ ,   log   h J ( · | θ , γ 0 ) σ . Also, the Fisher information I ( θ ( γ 0 ) ) is given by
I ( θ ( γ 0 ) ) = I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) d y .
In addition, let s ˙ J ( · | θ , γ 0 ) be the gradient of s J ( · | θ , γ 0 ) with respect to θ , and s ¨ J ( · | θ , γ 0 ) is the second derivative matrix of s J ( · | θ , γ 0 ) with respect to θ . In addition, let t J ( · | θ ) = h * J 1 2 ( · | θ ) and v J ( · | θ ) = log h * J ( · | θ ) . Furthermore, let Y i l * denote Y i l when γ 0 0 . Please note that in this case, Y i l = Y 1 l for all i = 1 , 2 , , S . The corresponding kernel density estimate of Y i l * is given by
g B * ( y ) = 1 B c B l = 1 B K y Y i l * c B .
We emphasize here that we suppress i on the LHS of the above equation since g B ( i ) * ( · ) are equal for all 1 i S .
The iterated limit distribution involves additional regularity conditions which are stated in the Appendix. The first step towards this aim is a representation formula which expresses the quantity of interest, viz., B ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) as a sum of two terms, one involving sums of compressed i.i.d. random variables and the other involving remainder terms that converge to 0 at a specific rate. This expression will appear in different guises in the rest of the manuscript and will play a critical role in the proofs.

3.4. Representation Formula

Before we state the lemma, we first provide two crucial assumptions that allow differentiating the objective function and interchanging the differentiation and integration:
Model assumptions on h J ( · | θ , γ 0 )
(D1) h J ( · | θ , γ 0 ) is twice continuously differentiable in θ .
(D2) Assume further that | | s J ( · | θ , γ 0 ) | | 2 is continuous and bounded.
Lemma 1.
Assume that the conditions (D1) and (D2) hold. Then for every 1 i S and γ 0 0 , the following holds:
B 1 2 ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) = A 1 B ( γ 0 ) + A 2 B ( γ 0 ) , where
A 1 B ( γ 0 ) = B 1 2 D B 1 ( θ ˜ i , B ( γ 0 ) ) T B ( γ 0 ) , A 2 B ( γ 0 ) = B 1 2 D B 1 ( θ ˜ i , B ( γ 0 ) ) R B ( γ 0 ) ,
θ ˜ i , B ( γ 0 ) U B ( θ ( γ 0 ) ) , U B ( θ ( γ 0 ) ) = { θ : θ ( γ 0 ) = t θ * ( γ 0 ) + ( 1 t ) θ ^ i , B ( γ 0 ) , t [ 0 , 1 ] } ,
D B ( θ ( γ 0 ) ) = 1 2 I R u ˙ J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) g B ( i ) 1 2 ( y ) d y 1 4 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) g B ( i ) 1 2 ( y ) d y D 1 B ( θ ( γ 0 ) ) + D 2 B ( θ ( γ 0 ) ) ,
T B ( γ 0 ) 1 4 I R u J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) g B ( i ) ( y ) d y , and
R B ( γ 0 ) 1 4 I R u J ( y | θ * , γ 0 ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) 1 2 ( y ) 2 d y .
Proof. 
By algebra, note that s ˙ J ( y | θ , γ 0 ) = 1 2 u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) . Furthermore, the second partial derivative of s J ( · | θ , γ 0 ) is given by s ¨ J ( y | θ , γ 0 ) = 1 2 u ˙ J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) + 1 4 u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) . Now using (D1) and (D2) and partially differentiating H D B 2 θ ( γ 0 ) H D 2 ( g B ( i ) ( · ) , h J ( · | θ , γ 0 ) ) with respect to θ and setting it equal to 0, the estimating equations for θ * ( γ 0 ) is
H D B 2 θ * ( γ 0 ) = 0 .
Let θ ^ i , B ( γ 0 ) be the solution to (14). Now applying first order Taylor expansion of (14) we get
H D B 2 θ * ( γ 0 ) = H D B 2 ( θ ^ i , B ( γ 0 ) ) + D B ( θ ˜ i , B ( γ 0 ) ) ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) ,
where θ ˜ i , B ( γ 0 ) is defined in (10), and D B ( · ) is given by
D B ( θ ( γ 0 ) ) = 1 2 I R u ˙ J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) g B ( i ) 1 2 ( y ) d y 1 4 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) g B ( i ) 1 2 ( y ) d y D 1 B ( θ ( γ 0 ) ) + D 2 B ( θ ( γ 0 ) ) ,
and H D B 2 ( · ) is given by
H D B 2 ( θ ( γ 0 ) ) = 1 2 I R u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) 1 2 ( y ) d y .
Thus,
θ ^ i , B ( γ 0 ) θ * ( γ 0 ) = D B 1 ( θ ˜ i , B ( γ 0 ) ) H D B 2 ( θ * ( γ 0 ) ) .
By using the identity, b 1 2 a 1 2 = ( 2 a 1 2 ) 1 ( ( b a ) ( b 1 2 a 1 2 ) 2 ) , H D B 2 ( θ * ( γ 0 ) ) can be expressed as the difference of T B ( γ 0 ) and R B ( γ 0 ) , where
T B ( γ 0 ) 1 4 I R u J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) g B ( i ) ( y ) d y ,
and
R B ( γ 0 ) 1 4 I R u J ( y | θ * , γ 0 ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) 1 2 ( y ) 2 d y .
Hence,
B 1 2 θ ^ i , B ( γ 0 ) θ * ( γ 0 ) = A 1 B ( γ 0 ) + A 2 B ( γ 0 ) ,
where A 1 B ( γ 0 ) and A 2 B ( γ 0 ) are given in (9). □
Remark 2.
In the rest of the manuscript, we will refer to A 2 B ( γ 0 ) as the remainder term in the representation formula.
We now turn to the first main result of the manuscript, namely a central limit theorem for θ ^ i , B ( γ 0 ) as first B and then γ 0 0 . As a first step, we note that the Fisher information of the density h * J ( · | θ ) is given by
I ( θ ) = I R v J ( y | θ ) v J ( y | θ ) h * J ( y | θ ) d y .
Next we state the assumptions needed in the proof of Theorem 1. We separate these conditions as (i) model assumptions, (ii) kernel assumptions, (iii) regularity conditions, (iV) conditions that allow comparison of original data and compressed data.
Model assumptions on h * J ( · | θ )
(D1’) h * J ( · | θ ) is twice continuously differentiable in θ .
(D2’) Assume further that | | t J ( · | θ ) | | 2 is continuous and bounded.
Kernel assumptions
(B1) K ( · ) is symmetric about 0 on a compact support and bounded in L 2 . We denote the support of K ( · ) by S u p p ( K ) .
(B2) The bandwidth c B satisfies c B 0 , B 1 2 c B , B 1 2 c B 2 0 .
Regularity conditions
(M1) The function u J ( · | θ , γ 0 ) s J ( · | θ , γ 0 ) is continuously differentiable and bounded in L 2 at θ * .
(M2) The function u ˙ J ( · | θ , γ 0 ) s J ( · | θ , γ 0 ) is continuous and bounded in L 2 at θ * . In addition, assume that
lim B I R u ˙ J ( y | θ i , B , γ 0 ) s J ( y | θ i , B , γ 0 ) u ˙ J ( y | θ * , γ 0 ) s J ( y | θ * , γ 0 ) 2 d y = 0 .
(M3) The function u J ( · | θ , γ 0 ) u J ( · | θ , γ 0 ) s J ( · | θ , γ 0 ) is continuous and bounded in L 2 at θ * ; also,
lim B I R ( u J ( y | θ ^ i , B , γ 0 ) u J ( y | θ ^ i , B , γ 0 ) s J ( y | θ i , B , γ 0 ) u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) s J ( y | θ * , γ 0 ) ) 2 d y = 0 .
(M4) Let { α B : B 1 } be a sequence diverging to infinity. Assume that
lim B B sup t S u p p ( K ) P θ * ( γ 0 ) | Δ c B t | > α B = 0 ,
where S u p p ( K ) is the support of the kernel density K ( · ) and Δ is a generic random variable with density h J ( · | θ * , γ 0 ) .
(M5) Let
M B = sup | y | α B sup t S u p p ( K ) h J ( y t c B | θ * , γ 0 ) h J ( y | θ * , γ 0 ) .
Assume sup B 1 M B < .
(M6) The score function has a regular central behavior relative to the smoothing constants, i.e.,
lim B ( B 1 2 c B ) 1 α B α B u J ( y | θ * , γ 0 ) d y = 0 .
Furthermore,
lim B ( B 1 2 c B 4 ) α B α B u J ( y | θ * , γ 0 ) d y = 0 .
(M7) The density functions are smooth in an L 2 sense; i.e.,
lim B sup t S u p p ( K ) I R u J ( y + c B t | θ * , γ 0 ) u J ( y | θ * , γ 0 ) 2 h J ( y | θ * , γ 0 ) d y = 0 .
(M1’) The function v J ( · | θ ) t J ( · | θ ) is continuously differentiable and bounded in L 2 at θ * .
(M2’) The function v ˙ J ( · | θ ) t J ( · | θ ) is continuous and bounded in L 2 at θ * . In addition, assume that
lim B I R v ˙ J ( y | θ B ) t J ( y | θ B ) v ˙ J ( y | θ * ) t J ( y | θ * ) 2 d y = 0 .
(M3’) The function v J ( · | θ ) v J ( · | θ ) t J ( · | θ ) is continuous and bounded in L 2 at θ * . also,
lim B I R ( v J ( y | θ ^ i , B ) v J ( y | θ ^ i , B ) t J ( y | θ ^ i , B ) v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) ) 2 d y = 0 .
Assumptions comparing models for original and compressed data
(O1) For all θ Θ ,
lim γ 0 0 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) v J ( y | θ ) v J ( y | θ ) t J ( y | θ ) 2 d y = 0 .
(O2) For all θ Θ ,
lim γ 0 0 I R u ˙ J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) v ˙ J ( y | θ ) t J ( y | θ ) 2 d y = 0 .
Theorem 1.
Assume that the conditions (B1)–(B2), (D1)–(D2) , (D1’)–(D2’), (M1)–(M7), (M1’)–(M3’), and (O1)–(O2) hold. Then, for every 1 i S , the following holds:
lim γ 0 0 lim B P B ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) x = P G x ,
where G is a bivariate Gaussian random variable with mean 0 and variance I 1 ( θ * ) , where I ( θ ) is defined in (15).
Before we embark on the proof of Theorem 1, we first discuss the assumptions. Assumptions (B1) and (B2) are standard assumptions on the kernel and the bandwidth and are typically employed when investigating the asymptotic behavior of divergence-based estimators (see for instance [1]). Assumptions (M1)–(M7) and (M1’)–(M3’) are regularity conditions which are concerned essentially with L 2 continuity and boundedness of the scores and their derivatives. Assumptions (O1)–(O2) allow for comparison of u J ( · | θ , γ 0 ) and v J ( · | θ ) . Returning to the proof of Theorem 1, using representation formula, we will first show that lim γ 0 0 lim B P A 1 B ( γ 0 ) x = P G x , and then prove that lim γ 0 0 lim B A 2 B ( γ 0 ) = 0 in probability. We start with the following proposition.
Proposition 3.
Assume that the conditions (B1), (D1)–(D2), (M1)–(M3), (M1’)–(M3’), (M7) and (O1)–(O2) hold. Then,
lim γ 0 0 lim B P A 1 B ( γ 0 ) x = P G x ,
where G is given in Theorem 1.
We divide the proof of Proposition 3 into two lemmas. In the first lemma we will show that
lim γ 0 0 lim B D B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) .
Next in the second lemma we will show that first letting B and then allowing γ 0 0 ,
4 B 1 2 T B ( γ 0 ) d N 0 , I ( θ * ) .
We start with the first part.
Lemma 2.
Assume that the conditions (D1)–(D2), (D1’)–(D2’), (M1)–(M3), (M1’)–(M3’) and (O1)–(O2) hold. Then, with probability one, the following prevails:
lim γ 0 0 lim B D B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) .
Proof. 
Using representation formula in Lemma 1. First fix γ 0 > 0 . It suffices to show
lim B D 1 B ( θ ˜ i , B ( γ 0 ) ) = 1 2 I ( θ * ( γ 0 ) ) , and lim B D 2 B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ( γ 0 ) ) .
We begin with D 1 B ( θ ˜ i , B ( γ 0 ) ) . By algebra, D 1 B ( θ ˜ i , B ( γ 0 ) ) can be expressed as
D 1 B ( θ ˜ i , B ( γ 0 ) ) = D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) + D 1 B ( 2 ) ( θ ˜ i , B ( γ 0 ) ) + D 1 B ( 3 ) ( θ * ( γ 0 ) ) , where
D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) = 1 2 I R u ˙ J ( y | θ ˜ i , B , γ 0 ) s J ( y | θ ˜ i , B , γ 0 ) g B ( i ) 1 2 ( y ) s J ( y | θ * , γ 0 ) d y ,
D 1 B ( 2 ) ( θ ˜ i , B ( γ 0 ) ) = 1 2 I R ( u ˙ J ( y | θ ˜ i , B , γ 0 ) s J ( y | θ ˜ i , B , γ 0 ) u ˙ J ( y | θ * , γ 0 ) s J ( y | θ * , γ 0 ) ) h J 1 2 ( y | θ * , γ 0 ) d y ,
and D 1 B ( 3 ) ( θ B * ( γ 0 ) ) = 1 2 I R u ˙ J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y = 1 2 I ( θ * ( γ 0 ) ) .
It suffices to show that as B , D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) 0 , and D 1 B ( 2 ) ( θ ˜ i , B ( γ 0 ) ) 0 . We first consider D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) . By Cauchy-Schwarz inequality and assumption (M2), it follows that there exists 0 < C 1 < ,
D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) 1 2 I R ( u ˙ J ( y | θ ˜ i , B , γ 0 ) s J ( y | θ ˜ i , B , γ 0 ) ) 2 d y 1 2 I R g B ( i ) 1 2 ( y ) s J ( y | θ * , γ 0 ) 2 d y 1 2 C 1 I R g B ( i ) 1 2 ( y ) s J ( y | θ * , γ 0 ) 2 d y 1 2 0 ,
where the last convergence follows from the L 1 convergence of g B ( i ) ( · ) and h J ( · | θ * , γ 0 ) . Hence, as B , D 1 B ( 1 ) ( θ ˜ i , B ( γ 0 ) ) 0 . Next we consider D 1 B ( 2 ) ( θ ˜ i , B ( γ 0 ) ) . Again, by Cauchy-Schwarz inequality and assumption (M2), it follows that D 1 B ( 2 ) ( θ ˜ i , B ( γ 0 ) ) 0 . Hence D 1 B ( θ ˜ i , B ( γ 0 ) ) 1 2 I ( θ * ( γ 0 ) ) . Turning to D 2 B ( θ ˜ i , B ( γ 0 ) ) , by similar argument, using Cauchy-Schwarz inequality and assumption (M3), it follows that D 2 B ( θ ˜ i , B ( γ 0 ) ) 1 4 I ( θ * ( γ 0 ) ) . Thus, to complete the proof, it is enough to show that
lim γ 0 0 lim B D 1 B ( θ ˜ i , B ( γ 0 ) ) = 1 2 I ( θ * ) and lim γ 0 0 lim B D 2 B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) .
We start with the first term of (16). Let
J 1 ( γ 0 ) = I R u ˙ J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y I R v ˙ J ( y | θ * ) h * J ( y | θ * ) d y .
We will show that lim γ 0 0 J 1 ( γ 0 ) = 0 . By algebra, the difference of the above two terms can be expressed as the sum of J 11 ( γ 0 ) and J 12 ( γ 0 ) , where
J 11 ( γ 0 ) = I R u ˙ J ( y | θ * , γ 0 ) s J ( y | θ * , γ 0 ) v ˙ J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) d y , and
J 12 ( γ 0 ) = I R v ˙ J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) t J ( y | θ * ) d y .
J 11 ( γ 0 ) converges to zero by Cauchy-Schwarz inequality and assumption (O2), and J 12 ( γ 0 ) converges to zero by Cauchy-Schwarz inequality, assumption (M2’) and Scheffe’s theorem. Next we consider the second term of (16). Let
J 2 ( γ 0 ) = I R u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y I R v J ( y | θ * ) v J ( y | θ * ) h * J ( y | θ * ) d y .
We will show that lim γ 0 0 J 2 ( γ 0 ) = 0 . By algebra, the difference of the above two terms can be expressed as the sum of J 21 ( γ 0 ) and J 22 ( γ 0 ) , where
J 21 ( γ 0 ) = I R u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) s J ( y | θ * , γ 0 ) v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) d y ,
and J 22 ( γ 0 ) = I R v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) t J ( y | θ * ) d y .
J 11 ( γ 0 ) converges to zero by Cauchy-Schwarz inequality and assumption (O1), and J 12 ( γ 0 ) converges to zero by Cauchy-Schwarz inequality, assumption (M3’) and Scheffe’s theorem. Therefore the lemma holds. □
Lemma 3.
Assume that the conditions (B1), (D1)–(D2), (D1’)–(D2’), (M1)–(M3), (M3’), (M7) and (O1)–(O2) hold. Then, first letting B , and then γ 0 0 ,
4 B 1 2 T B ( γ 0 ) d N 0 , I ( θ * ) .
Proof. 
First fix γ 0 > 0 . Please note that using I R u J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y = 0 , we have that
4 B 1 2 T B ( γ 0 ) = B 1 2 I R u J ( y | θ * , γ 0 ) g B ( i ) ( y ) d y = B 1 2 I R u J ( y | θ * , γ 0 ) 1 B l = 1 B 1 c B K y Y i l c B d y = B 1 2 1 B l = 1 B I R u J ( Y i l + c B t | θ * , γ 0 ) K ( t ) d t .
Therefore,
4 B 1 2 T B ( γ 0 ) B 1 2 1 B l = 1 B u J ( Y i l | θ * , γ 0 ) = B 1 2 1 B l = 1 B I R u J ( Y i l + c B t | θ * , γ 0 ) u J ( Y i l | θ * , γ 0 ) K ( t ) d t .
Since Y i l ’s are i.i.d. across l, using Cauchy-Schwarz inequality and assumption (B1), we can show that there exists 0 < C < ,
E 4 B 1 2 T B B 1 2 1 B l = 1 B u J ( Y i l | θ * , γ 0 ) 2 = E I R u J ( Y i 1 + c B t | θ * , γ 0 ) u J ( Y i 1 | θ * , γ 0 ) K ( t ) d t 2 C E I R u J ( Y i 1 + c B t | θ * , γ 0 ) u J ( Y i 1 | θ * , γ 0 ) 2 d t 1 2 2 C E I R u J ( Y i 1 + c B t | θ * , γ 0 ) u J ( Y i 1 | θ * , γ 0 ) 2 d t = C I R I R u J ( y + c B t | θ * , γ 0 ) u J ( y | θ * , γ 0 ) 2 h J ( y | θ * , γ 0 ) d y d t ,
converging to zero as B by assumption (M7). Also, the limiting distribution of 4 B 1 2 T B ( γ 0 ) is N ( 0 , I ( θ * ( γ 0 ) ) ) as B . Now let γ 0 0 . It is enough to show that as γ 0 0 the density of N ( 0 , I ( θ * ( γ 0 ) ) ) converges to the density of N ( 0 , I ( θ * ) ) . To this end, it suffices to show that lim γ 0 0 I ( θ * ( γ 0 ) ) = I ( θ * ) . However, this is established in Lemma 2. Combining the results, the lemma follows. □
Proof of Proposition 3.
The proof of Proposition 3 follows immediately by combining Lemmas 2 and 3. □
We now turn to establishing that the remainder term in the representation formula converges to zero.
Lemma 4.
Assume that the assumptions (B1)–(B2), (M1)–(M6) hold. Then
lim γ 0 0 lim B A 2 B ( γ 0 ) = 0 in probability .
Proof. 
Using Lemma 2, it is sufficient to show that B 1 2 R B converges to 0 in probability as B . Let
d J ( y | θ * ( γ 0 ) ) = g B ( i ) 1 2 ( y ) s J ( y | θ * , γ 0 ) .
Please note that
d J 2 ( y | θ * ( γ 0 ) ) 2 h J ( y | θ * , γ 0 ) E g B ( i ) ( y ) 2 + E g B ( i ) ( y ) g B ( i ) ( y ) 2 h J 1 ( y | θ * , γ 0 ) .
Then
| R B ( γ 0 ) | 1 2 I R | u J ( y | θ * , γ 0 ) | d J 2 ( y | θ * ( γ 0 ) ) d y 1 2 α B α B | u J ( y | θ * , γ 0 ) | d J 2 ( y | θ * ( γ 0 ) ) d y + 1 2 | y | α B | u J ( y | θ * , γ 0 ) | d J 2 ( y | θ * ( γ 0 ) ) d y R 1 B ( γ 0 ) + R 2 B ( γ 0 ) .
We first deal with R 1 B ( γ 0 ) , which can be expressed as the sum of R 1 B ( γ 0 ) and R 2 B ( γ 0 ) , where
R 1 B ( 1 ) ( γ 0 ) = α B α B | u J ( y | θ * , γ 0 ) | h J ( y | θ * , γ 0 ) E g B ( i ) ( y ) 2 h J 1 ( y | θ * , γ 0 ) d y ,
and R 1 B ( 2 ) ( γ 0 ) = α B α B | u ( y | θ * , γ 0 ) | E g B ( i ) ( y ) g B ( i ) ( y ) 2 h J 1 ( y | θ * , γ 0 ) d y .
Now consider R 1 B ( 2 ) . Let ϵ > 0 be arbitrary but fixed. Then, by Markov’s inequality,
P B 1 2 R 1 B ( 2 ) > ϵ ϵ 1 B 1 2 E R 1 B ( 2 ) ϵ 1 B 1 2 α B α B | u J ( y | θ * , γ 0 ) | Var g B ( i ) ( y ) h J 1 ( y | θ * , γ 0 ) d y .
Now since Y i l s are independent and identically distributed across l, it follows that
Var g B ( i ) ( y ) 1 B c B I R K 2 ( t ) h J ( y t c B | θ * , γ 0 ) d t .
Now plugging (19) into (18), interchanging the order of integration (using Tonelli’s Theorem), we get
P B 1 2 R 1 B ( 2 ) > ϵ C B 1 2 c B 1 α B α B | u J ( y | θ * , γ 0 ) | d y 0 ,
where C is a universal constant, and the last convergence follows from conditions (M5)–(M6). We now deal with R 1 B ( 1 ) . To this end, we need to calculate E g B ( i ) ( y ) h J ( y | θ * , γ 0 ) 2 . Using change of variables, two-step Taylor approximation, and assumption (B1), we get
E g B ( i ) ( y ) h J ( y | θ * , γ 0 ) = I R K ( t ) h J ( y t c B | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d t = I R K ( t ) ( t c B ) 2 2 h J ( y B * ( t ) | θ * , γ 0 ) d t .
Now plugging in (20) into (17) and using conditions (M3) and (M6), we get
B 1 2 R 1 B ( 1 ) ( γ 0 ) C B 1 2 c B 4 α B α B | u J ( y | θ * , γ 0 ) | d y .
Convergence of (21) to 0 now follows from condition (M6). We next deal with R 2 B ( γ 0 ) . To this end, by writing our the square term of d J ( · | θ * ( γ 0 ) ) , we have
B 1 2 R 2 B ( γ 0 ) = | y | α B | u J ( y | θ * , γ 0 ) | h J ( y | θ * , γ 0 ) + g B ( i ) ( y ) s J ( y | θ * , γ 0 ) g B ( i ) 1 2 ( y ) d y .
We will show that the RHS of (22) converges to 0 as B . We begin with the first term. Please note that by Cauchy-Schwarz inequality,
B | y | α B | u J ( y | θ * , γ 0 ) | h J ( y | θ * , γ 0 ) d y 2 I R u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y { B P θ * ( γ 0 ) ( | Δ | α B ) } ,
the last term converges to 0 by (M4). As for the second term, note that, a.s., by Cauchy-Schwarz inequality,
| y | α B | u J ( y | θ * , γ 0 ) | g B ( i ) ( y ) d y 2 | y | α B u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) g B ( i ) ( y ) d y .
Now taking the expectation and using Cauchy-Schwarz inequality, one can show that
B E | y | α m | u J ( y | θ * , γ 0 ) | g B ( i ) ( y ) d y 2 a B I R K ( t ) I R u J ( y | θ * , γ 0 ) u J ( y | θ * , γ 0 ) h J ( y c B t | θ * , γ 0 ) d y d t ,
where a B = B sup z S u p p ( K ) P θ * | Δ c B z | > α B . The convergence to 0 of the RHS of above inequality now follows from condition (M4). Finally, by another application of the Cauchy-Schwarz inequality,
B E | y | α m | u J ( y | θ * , γ 0 ) | g B ( i ) 1 2 ( y ) s J ( y | θ * , γ 0 ) d y a B I R u J ( y c B t | θ * , γ 0 ) u J ( y c B t | θ * , γ 0 ) h J ( y | θ * , γ 0 ) d y .
The convergence of RHS of above inequality to zero follows from (M4). Now the lemma follows. □
Proof of Theorem 1.
Recall that
B 1 2 ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) = A 1 B ( γ 0 ) + A 2 B ( γ 0 ) ,
where A 1 B ( γ 0 ) and A 2 B ( γ 0 ) are given in (9). Proposition 3 shows that lim γ 0 0 lim B A 1 B ( γ 0 ) = N ( 0 , I 1 ( θ * ) ) ; while Lemma 4 shows that lim γ 0 0 lim B A 2 B ( γ 0 ) = 0 in probability. The result follows from Slutsky’s theorem. □
We next show that by interchanging the limits, namely first allowing γ 0 to converge to 0 and then letting B the limit distribution of θ ^ i , B ( γ 0 ) is Gaussian with the same covariance matrix as Theorem 1. We begin with additional assumptions required in the proof of the theorem.
Regularity conditions
(M4’) Let { α B : B 1 } be a sequence diverging to infinity. Assume that
lim B B sup t S u p p ( K ) P θ * | Δ c B t | > α B = 0 ,
where S u p p ( K ) is the support of the kernel density K ( · ) and Δ is a generic random variable with density h * J ( · | θ * ) .
(M5’) Let
M B = sup | y | α B sup t S u p p ( K ) h * J ( y t c B | θ * ) h * J ( y | θ * ) .
Assume that sup B 1 M B < .
(M6’) The score function has a regular central behavior relative to the smoothing constants, i.e.,
lim B ( B 1 2 c B ) 1 α B α B v J ( y | θ * ) d y = 0 .
Furthermore,
lim B ( B 1 2 c B 4 ) α B α B v J ( y | θ * ) d y = 0 .
(M7’) The density functions are smooth in an L 2 sense; i.e.,
lim B sup t S u p p ( K ) I R v J ( y + c B t | θ * ) v J ( y | θ * ) 2 h * J ( y | θ * ) d y = 0 .
Assumptions comparing models for original and compressed data
(V1) Assume that lim γ 0 0 sup y | u J ( y | θ * , γ 0 ) v J ( y | θ * ) | = 0 .
(V2) v J ( · | θ ) is L 1 continuous in the sense that X n p X implies that E v J ( X n | θ ) v J ( X | θ ) = 0 , where the expectation is with respect to distribution K ( · ) .
(V3) Assume that for all θ Θ , I R h * J ( y | θ ) d y < .
(V4) Assume that for all θ Θ , lim γ 0 0 sup y s J ( y | θ , γ 0 ) t J ( y | θ ) 1 = 0 .
Theorem 2.
Assume that the conditions (B1)–(B2), (D1’)–(D2’), (M1’)–(M7’), (O1)–(O2) and (V1)–(V4) hold. Then,
lim B lim γ 0 0 P B ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) x = P G x ,
where G is a bivariate Gaussian random variable with mean 0 and variance I 1 ( θ * ) .
We notice that in the above Theorem 2 that we use conditions (V2)–(V4) which are regularity conditions on the scores of the J fold convolution of f ( · ) while ( V 1 ) facilitates comparison of the scores of the densities of the compressed data and that of the J fold convolution. As before, we will first establish (a):
lim B lim γ 0 0 P A 1 B ( γ 0 ) x = P G x ,
and then (b): lim B lim γ 0 0 A 2 B ( γ 0 ) = 0 in probability. We start with the proof of (a).
Proposition 4.
Assume that the conditions (B1)–(B2), (D1’)–(D2’), (M1’)–(M3’), (M7’), (O1)–(O2), and (V1)–(V2) hold. Then,
lim B lim γ 0 0 P A 1 B ( γ 0 ) x = P G x .
We divide the proof of Proposition 4 into two lemmas. In the first lemma, we will show that
lim B lim γ 0 0 D B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) .
In the second lemma, we will show that first let γ 0 0 , then let B ,
4 B 1 2 T B ( γ 0 ) d N 0 , I ( θ * ) .
Lemma 5.
Assume that the conditions (B1)–(B2), (D1’)–(D2’), (M1’)–(M3’), (O1)–(O2), and (V1)–(V2) hold. Then,
lim B lim γ 0 0 D B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) .
Proof. 
First fix B. Recall that
D B ( θ ( γ 0 ) ) = 1 2 I R u ˙ J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) ( g B ( i ) ( y ) ) 1 2 d y 1 4 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) ( g B ( i ) ( y ) ) 1 2 d y D 1 B ( θ ( γ 0 ) ) + D 2 B ( θ ( γ 0 ) ) .
By algebra, D 1 B ( θ ˜ i , B ( γ 0 ) ) can be expressed as the sum of H 1 B ( 1 ) , H 1 B ( 2 ) , H 1 B ( 3 ) , H 1 B ( 4 ) and H 1 B ( 5 ) , where
H 1 B ( 1 ) = 1 2 I R [ u ˙ J ( y | θ ˜ i , B , γ 0 ) s J ( y | θ ˜ i , B , γ 0 ) v ˙ J ( y | θ ˜ i , B ) t J ( y | θ ˜ i , B ) ] g B ( i ) 1 2 ( y ) d y ,
H 1 B ( 2 ) = 1 2 I R [ v ˙ J ( y | θ ˜ i , B ) t J ( y | θ ˜ i , B ) v ˙ J ( y | θ * ) t J ( y | θ * ) ] g B ( i ) 1 2 ( y ) d y ,
H 1 B ( 3 ) = 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) h J 1 2 ( y | θ * , γ 0 ) d y ,
H 1 B ( 4 ) = 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) t J ( y | θ * ) d y , and H 1 B ( 5 ) = 1 2 I ( θ * ) .
We will show that
lim γ 0 0 D 1 B ( θ ˜ i , B ( γ 0 ) ) = H 1 B ( 2 ) + lim γ 0 0 H 1 B ( 3 ) + H 1 B ( 5 ) ,
where
lim γ 0 0 H 1 B ( 3 ) = 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) g B * 1 2 ( y ) t J ( y | θ * ) d y and
g B * ( · ) is given in (7). First consider H 1 B ( 1 ) . it converges to zero as γ 0 0 by Cauchy-Schwarz inequality and assumption (O2). Next we consider H 1 B ( 3 ) . We will first show that
lim γ 0 0 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) d y = 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) g B * 1 2 ( y ) d y .
To this end, notice that by Cauchy-Schwarz inequality and boundedness of v ˙ J ( y | θ * ) t J ( y | θ * ) in L 2 , it follows that there exists a constant C such that
I R v ˙ J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) g B * 1 2 ( y ) d y C I R g B ( i ) 1 2 ( y ) g B * 1 2 ( y ) 2 d y 1 2 C I R g B ( i ) ( y ) g B * ( y ) d y 1 2 .
It suffices to show that g B ( i ) ( · ) converges to g B * ( · ) in L 1 . Since
I R | g B ( i ) ( y ) g B * ( y ) | d y = 2 2 I R min g B ( i ) ( y ) , g B * ( y ) d y ,
and min g B ( i ) ( y ) , g B * ( y ) g B * ( y ) , by dominated convergence theorem, g B ( i ) ( · ) L 1 g B * ( · ) . Next we will show that
lim γ 0 0 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) d y = 1 2 I R v ˙ J ( y | θ * ) t J ( y | θ * ) t J ( y | θ * ) d y .
In addition, by Cauchy-Schwarz inequality, boundedness of v ˙ J ( y | θ * ) t J ( y | θ * ) in L 2 and Scheffe’s theorem, we have that I R v ˙ J ( y | θ * ) h J 1 2 ( y | θ * , γ 0 ) s J ( y | θ * | γ 0 ) t J ( y | θ * ) d y converges to zero as γ 0 0 . Next we consider H 1 B ( 4 ) . it converges to zero by Cauchy-Schwarz inequality and assumption (M2’). Thus (24) holds. Now let B , we will show that lim B H 1 B ( 2 ) = 0 and lim B lim γ 0 0 H 1 B ( 3 ) = 0 . First consider lim B H 1 B ( 2 ) . It converges to zero by Cauchy-Schwarz inequality and assumption (M2’). Next we consider lim B lim γ 0 0 H 1 B ( 3 ) . It converges to zero by Cauchy-Schwarz inequality and L 1 convergence of g B * ( · ) and h * J ( · | θ * ) . Therefore lim B lim γ 0 0 D 1 B ( θ ˜ i , B ( γ 0 ) ) = 1 2 I ( θ * ) .
We now turn to show that lim B lim γ 0 0 D 2 B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) . First fix B and express D 2 B ( θ ˜ i , B ( γ 0 ) ) as the sum of H 2 B ( 1 ) , H 2 B ( 2 ) , H 2 B ( 3 ) , H 2 B ( 4 ) , and H 2 B ( 5 ) , where
H 2 B ( 1 ) = 1 4 I R u J ( y | θ ˜ i , B , γ 0 ) u J ( y | θ ˜ i , B , γ 0 ) s J ( y | θ ˜ i , B , γ 0 ) v J ( y | θ ˜ i , B ) v J ( y | θ ˜ i , B ) t J ( y | θ ˜ i , B ) g B ( i ) 1 2 ( y ) d y ,
H 2 B ( 2 ) = 1 4 I R v J ( y | θ ˜ i , B ) v J ( y | θ ˜ i , B ) t J ( y | θ ˜ i , B ) v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) d y ,
H 2 B ( 3 ) = 1 4 I R v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) h J 1 2 ( y | θ * , γ 0 ) d y ,
H 2 B ( 4 ) = 1 4 I R v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) s J ( y | θ * , γ 0 ) t J ( y | θ * ) d y , and H 2 B ( 5 ) = 1 4 I ( θ * ) .
We will show that
lim γ 0 0 D 2 B ( θ ˜ i , B ( γ 0 ) ) = H 2 B ( 2 ) + lim γ 0 0 H 2 B ( 3 ) + H 2 B ( 5 ) , where
lim γ 0 0 H 2 B ( 3 ) = 1 2 I R v J ( y | θ * ) v J ( y | θ * ) t J ( y | θ * ) g B * 1 2 ( y ) t J ( y | θ * ) d y .
First consider H 2 B ( 1 ) . It converges to zero as γ 0 0 by Cauchy-Schwarz inequality and assumption (O1). Next consider H 2 B ( 3 ) . By similar argument as above and boundedness of v J 2 ( y | θ * ) t J ( y | θ * ) , it follows that (27) holds. Next consider H 2 B ( 4 ) . It converges to zero as γ 0 0 by Cauchy-Schwarz inequality and assumption (M3’). Now let B , we will show that lim B H 2 B ( 2 ) = 0 and lim B lim γ 0 0 H 2 B ( 3 ) = 0 . First consider H 2 B ( 2 ) . It converges to zero by Cauchy-Schwarz inequality and assumption (M3’) as B . Finally consider lim B lim γ 0 0 H 2 B ( 3 ) . It converges to zero by Cauchy-Schwarz inequality and L 1 convergence of g B * ( · ) and h * J ( · | θ * ) . Thus lim B lim γ 0 0 D 2 B ( θ ˜ i , B ( γ 0 ) ) = 1 4 I ( θ * ) . Now letting B , the proof of (23) follows using arguments similar to the one in Lemma 2. □
Lemma 6.
Assume that the conditions (B1)–(B2),(D1’)–(D2’), (M1’)–(M3’), (M7’), (O1)–(O2), and (V1)–(V2) hold. Then, first letting B , and then letting γ 0 0 ,
4 B 1 2 T B ( γ 0 ) d N 0 , I ( θ * ) .
Proof. 
First fix B. We will show that as γ 0 0 ,
4 B 1 2 T B ( γ 0 ) d I R v J ( y | θ * ) g B * ( y ) d y .
First observe that
4 B 1 2 T B ( γ 0 ) I R v J ( y | θ * ) g B * ( y ) d y = I R u J ( y | θ * , γ 0 ) v J ( y | θ * ) g B ( i ) ( y ) d y
+ I R v J ( y | θ * ) g B ( i ) ( y ) g B * ( y ) d y .
We will show that the RHS of (29) converges to zero as γ 0 0 and the RHS of (30) converges to zero in probability as γ 0 0 . First consider the RHS of (29). Since
I R u J ( y | θ * , γ 0 ) v J ( y | θ * ) g B ( i ) ( y ) d y I R sup y | u J ( y | θ * , γ 0 ) v J ( y | θ * ) | g B ( i ) ( y ) d y ,
which converges to zero as γ 0 0 by assumption (V1). Next consider the RHS of (30). Since
I R v J ( y | θ * ) g B ( i ) ( y ) g B * ( y ) d y = 1 B l = 1 B I R v J ( Y i l + u c B ) v J ( Y i l * + u c B ) K ( u ) d u .
By assumption (V2), it follows that as γ 0 0 , (30) converges to zero in probability. Now letting B , we have
B 1 2 I R v J ( y | θ * ) g B * ( y ) d y B 1 2 1 B l = 1 B v J ( Y i l * | θ * ) = B 1 2 1 B l = 1 B I R v J ( Y i l * + c B t | θ * ) v J ( Y i l * | θ * ) K ( t ) d t ,
and
E B 1 2 I R v J ( y | θ * ) g B * ( y ) d y B 1 2 1 B l = 1 B v J ( Y i l * | θ * ) 2 = E B 1 2 1 B l = 1 B I R v J ( Y i l * + c B t | θ * ) v J ( Y i l * | θ * ) K ( t ) d t 2 C E I R v J ( Y i 1 * + c B t | θ * ) v J ( Y i 1 * | θ * ) 2 d t = C I R I R v J ( y + c B t | θ * ) v J ( y | θ * ) 2 h * J ( y | θ * ) d y d t 0 as B ,
where the last convergence follows by assumption (M7’). Hence, using the Central limit theorem for independent and identically distributed random variables it follows that the limiting distribution of B 1 2 I R v J ( y | θ * ) g B * ( y ) d y is N ( 0 , I ( θ * ) ) , proving the lemma. □
Proof of Proposition 4.
The proof of Proposition 4 follows by combining Lemmas 5 and 6. □
Lemma 7.
Assume that the conditions (M1’)–(M6’) and (V1)–(V4) hold. Then,
lim B lim γ 0 0 A 2 B ( γ 0 ) = 0 in probability .
Proof. 
First fix B. Let
H B ( γ 0 ) = I R u J ( y | θ * , γ 0 ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) ( y ) 2 d y I R v J ( y | θ * ) t J ( y | θ * ) g B * ( y ) 2 d y .
we will show that as γ 0 0 , H B ( γ 0 ) 0 . By algebra, H B ( γ 0 ) can be written as the sum of H 1 B ( γ 0 ) and H 2 B ( γ 0 ) , where
H 1 B ( γ 0 ) = I R u J ( y | θ * , γ 0 ) v J ( y | θ * ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) ( y ) 2 d y , and
H 2 B ( γ 0 ) = I R v J ( y | θ * ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) ( y ) 2 d y .
First consider H 1 B ( γ 0 ) . It is bounded above by C sup y | u J ( y | θ * , γ 0 ) v J ( y | θ * ) | , which converges to zero as γ 0 0 by assumption (V1), where C is a constant. Next consider H 2 B ( γ 0 ) . We will show that H 2 B ( γ 0 ) converges to
I R v J ( y | θ * ) t J ( y | θ * , γ 0 ) g B * ( y ) 2 d y .
In fact, the difference of H 2 B ( γ 0 ) and the above formula can be expressed as the sum of H 2 B ( 1 ) ( γ 0 ) , H 2 B ( 2 ) ( γ 0 ) , and H 2 B ( 3 ) ( γ 0 ) , where
H 2 B ( 1 ) ( γ 0 ) = I R v J ( y | θ * ) h J ( y | θ * , γ 0 ) h * J ( y | θ * ) d y ,
H 2 B ( 2 ) ( γ 0 ) = I R v J ( y | θ * ) g B ( i ) ( y ) g B * ( y ) d y , and
H 2 B ( 3 ) ( γ 0 ) = I R v J ( y | θ * ) h J 1 2 ( y | θ * , γ 0 ) g B ( i ) ( y ) t J ( y | θ * , γ 0 ) g B * ( y ) d y .
First consider H 2 B ( 1 ) ( γ 0 ) . Please note that
H 2 B ( 1 ) ( γ 0 ) I R | h * J ( y | θ * ) | h J ( y | θ * , γ 0 ) h * J ( y | θ * ) 1 d y sup y s J ( y | θ , γ 0 ) t J ( y | θ ) 1 2 + 2 sup y s J ( y | θ , γ 0 ) t J ( y | θ ) 1 I R | h * J ( y | θ * ) | d y ,
which converges to 0 as γ 0 0 by assumptions (V3) and (V4). Next we consider H 2 B ( 2 ) ( γ 0 ) . Since
H 2 B ( 2 ) ( γ 0 ) = 1 B l = 1 B I R v J ( Y i l + u c B | θ * ) v J ( Y i l * + u c B | θ * ) K ( u ) d u ,
which converges to zero as γ 0 0 due to assumption (V2). Finally consider H 2 B ( 3 ) ( γ 0 ) , which can be expressed as the sum of L 1 B ( γ 0 ) and L 2 B , where
L 1 B ( γ 0 ) = I R v J ( y | θ * ) h J 1 2 ( y | θ * , γ 0 ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) d y , and
L 2 B = I R v J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) g B * 1 2 ( y ) d y .
First consider L 1 B ( γ 0 ) . Notice that
L 1 B ( γ 0 ) sup y s J ( y | θ , γ 0 ) t J ( y | θ ) 1 I R v J ( y | θ * ) t J ( y | θ * ) g B ( i ) 1 2 ( y ) d y 0 ,
where the last convergence follows by Cauchy-Schwarz inequality and assumption (V4). Next we consider L 2 B . By Cauchy-Schwarz inequality, it is bounded above by
I R v J ( y | θ * ) v J ( y | θ * ) h * J ( y | θ * ) d y 1 2 I R g B ( i ) 1 2 ( y ) g B * 1 2 ( y ) 2 d y 1 2 .
Equation (31) converges to zero as γ 0 0 by boundedness of I R v J ( y | θ * ) v J ( y | θ * ) h * J ( y | θ * ) d y and L 1 convergence between g B ( i ) ( · ) and g B * ( · ) , where the L 1 convergence has already been established in Lemma 5. Now letting B , following similar argument as Lemma 4 and assumptions (M1’)–(M6’), the lemma follows. □
Proof of Theorem 2.
Recall that
B 1 2 ( θ ^ i , B ( γ 0 ) θ * ( γ 0 ) ) = A 1 B ( γ 0 ) + A 2 B ( γ 0 ) .
Proposition 4 shows that first letting γ 0 0 , then B , A 1 B ( γ 0 ) d N ( 0 , I 1 ( θ * ) ) ; while Lemma 7 shows that lim B lim γ 0 0 A 2 B ( γ 0 ) = 0 in probability. The theorem follows from Slutsky’s theorem. □
Remark 3.
The above two theorems (Theorems 1 and 2) do not immediately imply the double limit exists. This requires stronger conditions and more delicate calculations and will be considered elsewhere.

3.5. Robustness of MHDE

In this section, we describe the robustness properties of MHDE for compressed data. Accordingly, let h J , α , z ( · | θ , γ 0 ) ( 1 α ) h J ( · | θ , γ 0 ) + α η z , where η z denotes the uniform density on the interval ( z ϵ , z + ϵ ) , where ϵ > 0 is small, θ Θ , α ( 0 , 1 ) , and z I R . Also, let s J , α , z ( y | θ , γ 0 ) = h J , α , z 1 2 ( y | θ , γ 0 ) , u J , α , z ( y | θ , γ 0 ) = log h J , α , z ( y | θ , γ 0 ) , h α , z * J ( · | θ ) ( 1 α ) h * J ( · | θ ) + α η z , s α , z * J ( · | θ ) = h α , z * J 1 2 ( · | θ ) , and u α , z * J = log h α , z * J ( · | θ ) . Before we state the theorem, we describe certain additional assumptions-which are essentially L 2 continuity conditions-that are needed in the proof.
Model assumptions for robustness analysis
(O3) For α [ 0 , 1 ] and all θ Θ ,
lim γ 0 0 I R u ˙ J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) 2 d y = 0 .
(O4) For α [ 0 , 1 ] and all θ Θ ,
lim γ 0 0 I R u J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) u α , z * J ( y | θ ) u α , z * J ( y | θ ) s α , z * J ( y | θ ) 2 d y = 0 .
Theorem 3.
(i) Let α ( 0 , 1 ) , and assume that for all θ Θ , and assume that the assumptions of Proposition 1 hold, also assume that T ( h J , α , z ( · | θ , γ 0 ) ) is unique for all z. Then, T ( h J , α , z ( · | θ , γ 0 ) ) is a bounded, continuous function of z and
lim γ 0 0 lim | z | T ( h J , α , z ( · | θ , γ 0 ) ) = θ ;
(ii) Assume further that the conditions (V1), (M2)-(M3), and (O3)-(O4) hold. Then,
lim γ 0 0 lim α 0 α 1 T ( h J , α , z ( · | θ , γ 0 ) ) θ = I ( θ ) 1 I R η z ( y ) v J ( y | θ ) d y ,
Proof. 
Let θ z ( γ 0 ) denote T ( h J , α , z ( · | θ , γ 0 ) ) and let θ z denote T ( h α , z * J ( · | θ ) ) We first show that (32) holds. Let γ 0 0 be fixed. Then, by triangle inequality,
lim | z | | θ z ( γ 0 ) θ | lim | z | | θ z ( γ 0 ) θ ( γ 0 ) | + lim | z | | θ ( γ 0 ) θ | .
We will show that the first term of RHS of (33) is equal to zero. Suppose that it is not zero, without loss of generality, by going to a subsequence if necessary, we may assume that θ z θ 1 θ as | z | . Since θ z ( γ 0 ) minimizes H D 2 ( h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) , it follows that
H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) )
for every θ Θ . We now show that as | z | ,
H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ 1 , γ 0 ) ) .
To this end, note that as | z | , for every y,
h J , α , z ( y | θ , γ 0 ) ( 1 α ) h J ( y | θ , γ 0 ) , and h J ( y | θ z , γ 0 ) h J ( y | θ 1 , γ 0 )
Therefore, as | z | ,
H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ 1 , γ 0 ) ) 2 ( Q 1 + Q 2 ) ,
where
Q 1 = I R h J , α , z 1 2 ( y | θ , γ 0 ) ( 1 α ) h J ( y | θ , γ 0 ) 1 2 h J ( y | θ z , γ 0 ) ) 1 2 d y ,
Q 2 = I R h J 1 2 ( y | θ z , γ 0 ) h J ( y | θ 1 , γ 0 ) 1 2 ( 1 α ) h J ( y | θ , γ 0 ) 1 2 d y .
Now, by Cauchy-Schwarz inequality and Scheffe’s theorem, it follows that as | z | , Q 1 0 and Q 2 0 . Therefore, (35) holds. By Equations (34) and (35), we have
H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ 1 , γ 0 ) ) H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) )
for every θ Θ . Now consider
H I F ( α , h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) I R ( 1 α ) δ ( h J ( · | θ , γ 0 ) , h J ( y | θ , γ 0 ) ) + 1 1 2 1 2 h J ( y | θ , γ 0 ) d y ,
where δ ( h J ( · | θ , γ 0 ) , h J ( y | θ , γ 0 ) ) = h J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) 1 . Since G * ( δ ) = ( 1 α ) δ + 1 1 2 1 2 is a non-negative and strictly convex function with δ = 0 as the unique point of minimum. Hence H I F ( α , h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) > 0 unless δ ( h J ( · | θ , γ 0 ) , h J ( y | θ , γ 0 ) ) = 0 on a set of Lebesgue measure zero, which by the model identifiability assumption , is true if and only if θ = θ . Since θ 1 θ , it follows that
H I F ( α , h J ( · | θ , γ 0 ) , h J ( · | θ 1 , γ 0 ) ) > H I F ( α , h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) .
Since H I F ( α , h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) α . This implies that
H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ 1 , γ 0 ) ) > H D 2 ( ( 1 α ) h J ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) ,
which contradicts (36). The continuity of θ z follows from Proposition 2 and the boundedness follows from the compactness of Θ . Now let γ 0 0 , the second term of RHS of (33) converges to zero by Proposition 2.
We now turn to part (ii) of the Theorem. First fix γ 0 0 . Since θ z minimizes H 2 ( h J , α , z ( · | θ , γ 0 ) , t ( γ 0 ) ) over Θ . By Taylor expansion of H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) around θ , we get
0 = H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) = H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) + ( θ z θ ) D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z * , γ 0 ) ) ,
where θ z * ( γ 0 ) is a point between θ and θ z ,
H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R u J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y ,
and D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) can be expressed the sum of D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) and D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) , where
D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R u ˙ J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y and
D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R u J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y .
Therefore,
α 1 θ z θ = α 1 D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z * , γ 0 ) ) H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) .
We will show that
lim α 0 D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z * , γ 0 ) ) = 1 4 I ( θ ( γ 0 ) ) , and
lim α 0 α 1 H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R η z ( y ) u J ( y | θ , γ 0 ) d y .
We will first establish (40). Please note that as α 0 , by definition θ z ( α ) θ . Thus, lim α 0 θ z * ( α ) = θ . In addition, by assumptions (O3) and (O4), D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z , γ 0 ) ) is continuous in θ z . Therefore, to prove (40), it suffices to show that
lim α 0 D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R u ˙ J , α , z ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) d y = 1 2 I ( θ ( γ 0 ) ) , and
lim α 0 D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R u J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) d y = 1 4 I ( θ ( γ 0 ) ) .
We begin with D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) . Notice that
lim α 0 s J , α , z ( y | θ , γ 0 ) = s J ( y | θ , γ 0 ) , lim α 0 u J , α , z ( y | θ , γ 0 ) = u J ( y | θ , γ 0 ) , and
lim α 0 u ˙ J , α , z ( y | θ , γ 0 ) = u ˙ J ( y | θ , γ 0 ) .
Thus,
lim α 0 u ˙ J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) = u ˙ J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) .
In addition, in order to pass the limit inside the integral, note that, for every component of matrix u J , α , z ( · | θ , γ 0 ) u J , α , z ( · | θ , γ 0 ) , we have
u J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) = ( 1 α ) h J , α , z ( y | θ , γ 0 ) ( 1 α ) h J , α , z ( y | θ , γ 0 ) + α η z ( y ) ( 1 α ) h J , α , z ( y | θ , γ 0 ) ( 1 α ) h J , α , z ( y | θ , γ 0 ) + α η z ( y ) = h J , α , z ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) + α 1 α η z ( y ) h J , α , z ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) + α 1 α η z ( y ) h J , α , z ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) = u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) ,
where | · | represents the absolute function for each component of the matrix, and
s J , α , z ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + η z ( y ) 1 2 .
Now choosing the dominating function
m J ( 1 ) ( y | θ , γ 0 ) = u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + η z ( y ) 1 2 s J ( y | θ , γ 0 )
and applying Cauchy-Schwarz inequality, we obtain that there exists a constant C such that
I R m J ( 1 ) ( y | θ , γ 0 ) d y C I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) 2 d y 1 2 ,
which is finite by assumption (M2). Hence, by the dominated convergence theorem, (43) holds. Turning to (42), notice that for each component of the matrix u ˙ J , α , z ( y | θ , γ 0 ) ,
u ˙ J , α , z ( y | θ , γ 0 ) = h ¨ J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + α 1 α η z ( y ) h J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + α 1 α η z ( y ) 2 h ¨ J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + h J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) h J 2 ( y | θ , γ 0 ) ,
where | · | denotes the absolute function for each component. Now choosing the dominating function
m J ( 2 ) ( y | θ , γ 0 ) = h ¨ J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + h J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) h J 2 ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) + η z ( y ) 1 2 s J ( y | θ , γ 0 ) ,
and applying the Cauchy-Schwarz inequality it follows, using (M3), that
I R m J ( 2 ) ( y | θ , γ 0 ) d y < .
Finally, by the dominated convergence theorem, it follows that
lim α 0 D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ z * , γ 0 ) ) = 1 2 I ( θ ( γ 0 ) ) .
Therefore (40) follows. It remains to show that (41) holds. To this end, note that
H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R s J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y .
Now taking partial derivative of H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) with respect to α , it can be expressed as the sum of U 1 , U 2 and U 3 , where
U 1 = 1 4 I R h J ( y | θ , γ 0 ) + η z ( y ) s J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y ,
U 2 = 1 2 I R s J , α , z ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) h J , α , z ( y | θ , γ 0 ) h J , α , z 2 ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y , and
U 3 = 1 2 I R s J , α , z ( y | θ , γ 0 ) ( 1 α ) h J ( y | θ , γ 0 ) ( h J ( y | θ , γ 0 ) + η z ( y ) ) h J , α , z 2 ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) d y .
By dominated convergence theorem (using similar idea as above to find dominating functions), we have
lim α 0 H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) α = 1 4 I R u J ( y | θ , γ 0 ) η z ( y ) d y .
Hence, by L’Hospital rule, (41) holds. It remains to show that
lim γ 0 0 lim α 0 D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I ( θ ) , and
lim γ 0 0 lim α 0 α 1 H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R η z ( y ) v J ( y | θ ) d y .
We start with (44). Since for fixed γ 0 0 , by the above argument, it follows that
lim α 0 D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I ( θ ( γ 0 ) ) = 1 4 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) d y ,
it is enough to show
lim γ 0 0 I R u J ( y | θ , γ 0 ) u J ( y | θ , γ 0 ) h J ( y | θ , γ 0 ) d y = I R v J ( y | θ ) v J ( y | θ ) h * J ( y | θ ) d y ,
which is proved in Lemma 2. Hence (44) holds. Next we prove (45). By the argument used to establish (40), it is enough to show that
lim γ 0 0 I R η z ( y ) u J ( y | θ , γ 0 ) d y = I R η z ( y ) v J ( y | θ ) d y .
However,
I R η z ( y ) u J ( y | θ , γ 0 ) v J ( y | θ ) d y sup y u J ( y | θ , γ 0 ) v J ( y | θ ) ,
and the RHS of the above inequality converges to zero as γ 0 0 from assumption (V1). Hence (46) holds. This completes the proof. □
Our next result is concerned with the behavior of the α influence function when γ 0 0 first and then | z | or α 0 . The following three additional assumptions will be used in the proof of part (ii) of Theorem 4.
Model assumptions for robustness analysis
(O5) For α [ 0 , 1 ] and all θ Θ , u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) is bounded in L 2 .
(O6) For α [ 0 , 1 ] and all θ Θ , u α , z * J ( y | θ ) u α , z * J ( y | θ ) s α , z * J ( y | θ ) is bounded in L 2 .
(O7) For α [ 0 , 1 ] and all θ Θ ,
lim γ 0 0 I R s J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s α , z * J ( y | θ ) u α , z * J ( y | θ ) 2 d y = 0 .
Theorem 4.
(i) Let α ( 0 , 1 ) , and assume that for all θ Θ , assume that the assumptions of Proposition 1 hold, also assume that T ( h J , α , z ( · | θ , γ 0 ) ) is unique for all z. Then, T ( h J , α , z ( · | θ , γ 0 ) ) is a bounded, continuous function of z such that
lim | z | lim γ 0 0 T ( h J , α , z ( · | θ , γ 0 ) ) = θ ;
(ii) Assume further that the conditions (O3)–(O7) hold. Then,
lim α 0 lim γ 0 0 α 1 T ( h J , α , z ( · | θ , γ 0 ) ) θ = I ( θ ) 1 I R η z ( y ) v J ( y | θ ) d y .
Proof. 
Let θ z ( γ 0 ) denote T ( h J , α , z ( · | θ , γ 0 ) ) and let θ z denote T ( h α , z * J ( · | θ ) ) . First fix z I R ; then by the triangular inequality,
lim γ 0 0 | θ z ( γ 0 ) θ | lim γ 0 0 | θ z ( γ 0 ) θ z | + lim γ 0 0 | θ z θ | .
The first term of RHS of (47) is equal to zero due to proposition 2. Now let | z | , then the second term on the RHS of (47) converges to zero using similar argument as Theorem 3 with density converging to h α , z * J ( · | θ ) . This completes the proof of I). Turning to (ii), we will prove that
lim α 0 lim γ 0 0 D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I ( θ ) ,
lim α 0 lim γ 0 0 α 1 H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R η z ( y ) v J ( y | θ ) d y .
Recall from the proof of part (ii) of Theorem 3 that
D ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R u ˙ J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) + 1 4 I R u J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) s J ( y | θ , γ 0 ) D 1 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) + D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) .
We will now show that for fixed α ( 0 , 1 )
lim γ 0 0 D 1 ( h J , α , z ( θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) t J ( y | θ ) d y , and
lim γ 0 0 D 2 ( h J , α , z ( θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 4 I R u α , z * J ( y | θ ) u α , z * J ( y | θ ) s α , z * J ( y | θ ) t J ( y | θ ) d y .
We begin with (50). A standard calculation shows that D 1 ( h J , α , z ( θ , γ 0 ) , u α , z * J ( y | θ ) ) can be expressed as the sum of D 11 , D 12 and D 13 , where
D 11 = 1 2 I R u ˙ J , α , z ( y | θ , γ 0 ) s J , α , z ( y | θ , γ 0 ) u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) s J ( y | θ , γ 0 ) d y ,
D 12 = 1 2 I R u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) s J ( y | θ , γ 0 ) t J ( y | θ ) d y , and
D 13 = 1 2 I R u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) t J ( y | θ ) d y .
It can be seen that D 11 converges to zero as γ 0 0 by Cauchy-Schwarz inequality and assumption (O3); also, D 12 converges to zero as γ 0 0 by Cauchy-Schwarz inequality, assumption (O5) and Scheffe’s theorem. Hence (50) follows. Similarly (51) follows as γ 0 0 by Cauchy-Schwarz inequality, assumption (O4), assumption (O6) and Scheffe’s theorem.
Now let α 0 . Using the same idea as in Theorem 3 to find dominating functions, one can apply the dominated convergence Theorem to establish that
lim α 0 1 2 I R u ˙ α , z * J ( y | θ ) s α , z * J ( y | θ ) t J ( y | θ ) d y = 1 2 I ( θ ) , and
lim α 0 1 4 I R u α , z * J ( y | θ ) u α , z * J ( y | θ ) s α , z * J ( y | θ ) t J ( y | θ ) d y = 1 4 I ( θ ) .
Hence (48) follows. Finally, it remains to establish (49). First fix α ( 0 , 1 ) ; we will show that
lim γ 0 0 H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) = 1 2 I R s α , z * J ( y | θ ) u α , z * J ( y | θ ) t J ( y | θ ) d y .
Please H D 2 ( h J , α , z ( · | θ , γ 0 ) , h J ( · | θ , γ 0 ) ) can be expressed as the sum of T 1 , T 2 and T 3 , where
T 1 = 1 2 I R s J , α , z ( y | θ , γ 0 ) u J , α , z ( y | θ , γ 0 ) s α , z * J ( y | θ ) u α , z * J ( y | θ ) s J ( y | θ , γ 0 ) d y ,
T 2 = 1 2 I R s α , z * J ( y | θ ) u α , z * J ( y | θ ) s J ( y | θ , γ 0 ) t J ( y | θ ) d y , and
T 3 = 1 2 I R s α , z * J ( y | θ ) u α , z * J ( y | θ ) t J ( y | θ ) d y .
It can be seen thatr T 1 converges to zero as γ 0 0 by Cauchy-Schwarz inequality and assumption (O7); T 2 converges to zero as γ 0 0 by Cauchy-Schwarz inequality, boundedness of u α , z * J ( · ) s α , z * J ( · ) in L 2 , and Scheffe’s theorem. Therefore, (52) holds. Finally, letting α 0 and using the same idea as in Theorem 3 to find the dominating function, it follows by the dominated convergence theorem and L’Hospital rule that (49) holds. This completes the proof of the Theorem. □
Remark 4.
Theorems 3 and 4 do not imply that the double limit exists. This is beyond the scope of this paper.
In the next section, we describe the implementation details and provide several simulation results in support of our methodology.

4. Implementation and Numerical Results

In this section, we apply the proposed MHD based methods to estimate the unknown parameters θ = ( μ , σ 2 ) using the compressed data. We set J = 10,000 and B = 100 . All simulations are based on 5000 replications. We consider the Gaussian kernel and Epanechnikov kernel for the nonparametric density estimation. The Gaussian kernel is given by
K ( x ) = 1 2 π exp x 2 2 ,
and the Epanechnikov kernel is given by
K ( x ) = 3 4 1 x 2 1 | x | 1 .
We generate X and uncontaminated compressed data Y ˜ in the following way:
  • Step 1. Generate X l , where X j l i . i . d . N ( μ , σ 2 ) .
  • Step 2. Generate R l , where r i j l i . i . d . N ( 1 , γ 0 2 ) .
  • Step 3. Generate the uncontaminated Y ˜ l by calculating Y ˜ l = R l X l .

4.1. Objective Function

In practice, we store the compressed data ( Y ˜ l , r · l , ω l ) for all 1 l B . Hence if X j l follows Normal distribution with mean μ and variance σ 2 , the form of the marginal density of the compressed data, viz., Y i l is complicated and does not have a closed form expression. However, for large J, using the local limit theorem its density can be approximated by Gaussian density with mean J μ and variance σ 2 + γ 0 2 ( μ 2 + σ 2 ) . Hence, we work with U i l , where U i l = Y ˜ i l μ r i · l ω i l . Please note that with this transformation, E [ U i l ] = 0 and Var [ U i l ] = σ 2 . Hence, the kernel density estimate of the unknown true density is given by
g B ( i ) ( y | μ ) = 1 B c B l = 1 B K y U i l c B .
The difference between the kernel density estimate and the one proposed here is that we include the unknown parameter μ in the kernel. Additionally, this allows one to incorporate ( r · r , ω l ) into the kernel. Consequently, only the scale parameter σ is part of the parametric model. Using the local limit theorem, we approximate the true parametric model by ϕ ( · | σ ) , where ϕ ( · | σ ) is the density of N ( 0 , σ 2 ) . Hence, the objective function is
Ψ ( i , θ ) A ( g B ( i ) · | μ ) , ϕ ( · | σ ) = I R g B ( i ) 1 2 ( y | μ ) ϕ 1 2 ( y | σ ) d y ;
and, the estimator is given by
θ ^ B ( γ 0 ) = 1 S i = 1 S θ ^ i , B ( γ 0 ) , where θ ^ i , B ( γ 0 ) = argmax θ Θ Ψ ( i , θ ) .
It is clear that θ ^ B ( γ 0 ) is a consistent estimator of θ * . In the next subsection, we use Quasi-Newton method with Broyden-Fletcher-Goldfarb-Shanno (BFGS) update to estimate θ . Quasi-Newton method is appealing since (i) it replaces the complicated calculation of the Hessian matrix with an approximation which is easier to compute ( Δ k ( θ ) given in the next subsection) and (ii) gives more flexible step size t (compared to the Newton-Raphson method), ensuring that it does not “jump” too far at every step and hence guaranteeing convergence of the estimating equation. The BFGS update ( H k ) is a popular method for approximating the Hessian matrix via gradient evaluations. The step size t is determined using Backtracking line search algorithm described in Algorithm 2. The algorithms are given in detail in the next subsection. Our analysis also includes the case where S 1 and r i j l 1 . In this case, as explained previously, one obtains significant reduction in storage and computational complexity. Finally, we emphasize here that the density estimate contains μ and is not parameter free as is typical in classical MHDE analysis. In the next subsection, we describe an algorithm to implement our method.

4.2. Algorithm

As explained previously, we use the Quasi-Newton Algorithm with BFGS update to obtain θ ^ MHDE . To describe this method, consider the objective function (suppressing i) Ψ ( θ ) , which is twice continuously differentiable. Let the initial value of θ be θ ( 0 ) = μ ( 0 ) , σ ( 0 ) and H 0 = I , where I is the identity matrix.
Algorithm 1: The Quasi-Newton Algorithm.
Set k = 1.
  • repeat
    • Calculate Δ k ( θ ) = H k 1 1 Ψ ( θ ( k 1 ) ) , where Ψ ( y ; θ k 1 ) is the first derivative of Ψ ( θ ) with respect to θ at ( k 1 ) th step.
    • Determine the step length parameter t via backtracking line search.
    • Compute θ ( k ) = θ ( k 1 ) + t Δ k ( θ ) .
    • Compute H k , where the BFGS update is
      H k = H k 1 + q k 1 q k 1 T q k 1 T d k 1 H k 1 d k 1 d k 1 T H k 1 T d k 1 T H k 1 d k 1 ,
      where
      d k 1 = θ ( k ) θ ( k 1 ) , q k 1 = Ψ ( θ ( k ) ) Ψ ( θ ( k 1 ) ) .
    • Compute e k = | Ψ ( θ ( k ) ) Ψ ( θ ( k 1 ) ) | .
    • Set k = k + 1 .
  • until ( e k ) < threshold .
Remark 5.
In step 1, one can directly use the Inverse update for H k 1 as follows:
H k 1 = I d k 1 q k 1 T q k 1 T d k 1 H k 1 1 I q k 1 d k 1 T q k 1 T d k 1 + d k 1 d k 1 T q k 1 T d k 1 .
Remark 6.
In step 2, the step size t should satisfy the Wolfe conditions:
Ψ y ; θ ( k ) + t Δ k Ψ θ ( k ) + u 1 t Ψ T θ ( k ) Δ k , Ψ θ ( k ) + t Δ k u 2 Ψ T θ ( k ) Δ k ,
where u 1 and u 2 are constants with 0 < u 1 < u 2 < 1 . The first condition requires that t sufficiently decrease the objective function. The second condition ensures that the step size is not too small. The Backtracking line search algorithm proceeds as follows (see [26]):
Algorithm 2: The Backtracking Line Search Algorithm.
Given a descent direction Δ ( θ ) for Ψ at θ , ζ ( 0 , 0.5 ) , κ ( 0 , 1 ) . t : = 1 .

    while Ψ ( θ + t Δ θ ) > Ψ ( θ ) + ζ t Ψ ( θ ) T Δ θ ,
    do
         t : = κ t .
    end while

4.3. Initial Values

The initial value for θ are taken to be
μ ( 0 ) = median Y ˜ i l / J , σ ( 0 ) = 1.48 × median | Y ˜ i l median ( Y ˜ i l ) | / B .
Another choice of the initial value for σ is:
σ ^ ( 0 ) = ( Var [ Y ˜ i l ] ^ J γ 0 2 μ ) γ 0 2 + μ 0 2 ,
where Var [ Y ˜ i l ] ^ is an empirical estimate of the variance of Y ˜ 1 .
Bandwidth Selection: A key issue in implementing the above method of estimation is the choice of the bandwidth. We express the bandwidth in the form h B = c B s B , where c B { 0.3 , 0.4 , 0.5 , 0.7 , 0.9 } , and s B is set equal to 1.48 × median | Y ˜ i l median ( Y ˜ i l ) | / B .
In all the tables below, we report the average (Ave), standard deviation (StD) and mean square error (MSE) to assess the performance of the proposed methods.

4.4. Analyses Without Contamination

From Table 2, Table 3, Table 4 and Table 5, we let true μ = 2 , σ = 1 , and take the kernel to be Gaussian kernel. In Table 2, we compare the estimates of the parameters as the dimension of the compressed data S increases. In this table, we allow S to take values in the set { 1 , 2 , 5 , 10 } . Also, we let the number of groups B = 100 , the bandwidth is chosen to be c B = 0.3 , and γ 0 = 0.1 . In addition, in Table 2, S * = 1 means that S = 1 with γ 0 0 .
From Table 2 we observe that as S increases, the estimates for μ and σ remain stable. The case S * = 1 is interesting, since even by storing the sum we are able to obtain point estimates which are close to the true value. In Table 3, we choose S = 1 , B = 100 and c B = 0.3 and compare the estimates as γ 0 changes from 0.01 to 1.00. We can see that as γ 0 increases, the estimate for μ remains stable, whereas the bias, standard deviation and MSE for σ increase.
In Table 4, we fix S = 1 , B = 100 and γ 0 = 0.1 and allow the bandwidth c B to increase. Also, c B * = 0.30 means that the bandwidth is chosen as 0.30 with γ 0 0 . Notice that in this case when c B = 0.9 B 1 2 c B = 9 while B 1 2 c B 2 = 8.1 which is not small as is required in assumption (B2). We notice again that as c B decreases, the estimates of μ and σ are close to the true value with small MSE and StD.
In Table 5, we let S = 1 , c B = 0.3 and γ 0 = 0.1 and let the number of groups B increase. This table implies that as B increases, the estimate performs better in terms of bias, standard deviation and MSE.
In Table 6, we set γ 0 0 and keep other settings same as Table 5. This table implies that as B increases, the estimate performs better in terms of bias, standard deviation and MSE. Furthermore, the standard deviation and MSE are slightly smaller than the results in Table 5.
We next move on to investigating the effect of other sensing variables. In the following table, we use Gamma model to generate the additive matrix R l . Specifically, the mean of Gamma random variable is set as α 0 β 0 = 1 , and the variance v a r α 0 β 0 2 is chosen from the set { 0 , 0.01 2 , 0.01 , 0.25 , 1.00 } which are also the variances in Table 3.
From Table 7, notice that using Gamma sensing variable yields similar results as Gaussian sensing variable. Our next example considers the case when the mean of the sensing variable is not equal to one and the sensing variable is taken to have a discrete distribution.Specifically, we use Bernoulli sensing variables with parameter p. Moreover, we fix S = 1 and let p J = S . Therefore p = 1 / J . Hence as J increases, the variance decreases. Now notice that in this case the mean of sensing variable is p instead of 1. In addition, E [ Y ˜ i l ] = μ and Var [ Y ˜ i l ] = σ 2 + μ 2 ( 1 1 J ) . Hence we set the initial value as
μ ( 0 ) = median Y ˜ i l , σ ( 0 ) = 1.48 × median | Y ˜ i l median ( Y ˜ i l ) | .
Additionally, we take B = 100 , c B = 0.30 and s B to be 1.48 × median | Y ˜ i l median ( Y ˜ i l ) | .
Table 8 shows that MHD method also performs well with Bernoulli sensing variable, although the bias of σ , standard deviation and mean squre error for both estimates are larger than those using Gaussian sensing variable and Gamma sensing variable.

4.5. Robustness and Model Misspecification

In this section, we provide a numerical assessment of the robustness of the proposed methodology. To this end, let
f α , η ( x | θ ) = ( 1 α ) f ( x | θ ) + α η ( x ) ,
where η ( x ) is a contaminating component, α [ 0 , 1 ) . We generate the contaminated reduced data Y in the following way:
  • Step 1. Generate X l , where X j l i . i . d . N ( 2 , 1 ) .
  • Step 2. Generate R l , where r i j l i . i . d . N ( 1 , γ 0 2 ) .
  • Step 3. Generate uncontaminated Y ˜ l by calculating Y ˜ l = R l X l .
  • Step 4. Generate contaminated Y ˜ i l c , where Y ˜ i l c = Y ˜ i l + η ( x ) with probability α , and Y ˜ i l c = Y ˜ i l with probability 1 α .
In the above description, the contamination with outliers is within blocks. A conceptual issue that one encounters is the meaning of outliers in this setting. Specifically, a data point which is an outlier in the original data set may not remain an outlier in the reduced data and vice-versa. Hence the concepts such as breakdown point and influence function need to be carefully studied. The tables below present one version of the robustness exhibited by the proposed method. In Table 9 and Table 10, we set J = 10 4 , B = 100 , S = 1 , γ 0 = 0.1 , c B = 0.3 , η = 1000 . In addition, α * = 0 means that α = 0 with γ 0 0 .
From the above Table we observe that, even under 50 % contamination the estimate of the mean remains stable; however, the estimate of the variance is affected at high-levels of contamination (beyond 30 % ). An interesting and important issue is to investigate the role of γ 0 on the breakdown point of the estimator.
Finally, we investigate the bias in MHDE as a function of the values of the outlier. The graphs below (Figure 2) describe the changes to MHDE when outlier values ( η ) increase. Here we set S = 1 , B = 100 , γ 0 = 0.1 . In addition, we let α = 0.2 , and η to take values from { 100 , 200 , 300 , 400 , 500 , 600 , 700 , 800 , 900 , 1000 } . We can see that as η increases, both μ ^ and σ ^ increase up to η = 500 then decrease, although μ ^ does not change too much. This phenomenon is because when the outlier value is small (or closer to the observations), then it may not be considered as an “outlier” by the MHD method. However, as the outlier values move “far enough” from other values, then the estimate for μ and σ remain the stable.

5. Example

In this section we describe an analysis of data from financial analytics, using the proposed methods. The data are from a bank (a cash and credit card issuer) in Taiwan and the targets of analyses were credit card holders of the bank. The research focused on the case of customers’ default payments. The data set (see [27] for details) contains 180,000 observations and includes information on twenty five variables such as default payments, demographic factors, credit data, history of payment, and billing statements of credit card clients from April 2005 to September 2005. Ref. [28] study machine learning methods for evaluating the probability of default. Here, we work with the first three months of data containing 90,000 observations concerning bill payments. For our analyses we remove zero payments and negative payment from the data set and perform a logarithmic transformation of the bill payments . Since the log-transformed data was multi-modal and exhibited features of a mixture of normal distributions, we work with the log-transformed data with values in the range (6.1, 13). Next, we performed the Box-Cox transformation to the log-transformed data. This transformation identifies the best transformation that yields approximately normal distribution (which belongs to the location-scale family). Specifically, let L denote the log-transformed data in range (6.1, 13), then the data after Box-Cox transformation is given by X = L 2 1 / 19.9091 . The histogram for X is given in Figure 3. The number of observations at the end of data processing was 70,000.
Our goal is to estimate the average bill payment during the first three months. For this, we will apply the proposed method. In this analysis, we assume that the target model for X is Gaussian and split the data, randomly, into B = 100 blocks yielding J = 700 observations per block.
In Table 11, “est” represents the estimator, “ 95 % CI” stands for 95 % confidence interval for the estimator. When analyzing the whole data and choosing bandwidth as c n = 0.30 , we get the MHDE of μ to be μ ^ = 5.183 with 95 % confidence interval ( 5.171 , 5.194 ) , and the MHDE of σ as σ ^ = 1.425 with confidence interval ( 1.418 , 1.433 ) .
In Table 11, we choose the bandwidth as c B = 0.30 . Also, S * = 1 represents the case where S = 1 and γ 0 0 . In all other settings, we keep γ 0 = 0.1 . We observe that all estimates are similar as S changes.
Next we study the robustness of MHDE for this data by investigating the relative bias and studying the influence function. Specifically, we first reduce the dimension from J = 700 to S = 1 for each of the B = 100 blocks and obtain the compressed data Y ˜ ; next, we generate the contaminated reduced data Y ˜ i l c from step 4 in Section 4.5. Also, we set α = 0.20 , γ 0 = 0.20 ; the kernel is taken to be to be Epanechnikov density with bandwidth c B = 0.30 . η ( x ) is assumed to takes values in { 50 , 100 , 200 , 300 , 500 , 800 , 1000 } (note that the approximate mean of Y ˜ is around 3600). Let T MHD be the Hellinger distance functional. The influence function given by
IF ( α ; T , Y ˜ ) = T MHD ( Y ˜ c ) T MHD ( Y ˜ ) α ,
which we use to assess the robustness. The graphs shown below (Figure 4) illustrate how the influence function changes as the outlier values increase. We observe that for both estimates ( μ ^ and σ ^ ), the influence function first increase and then decrease fast. From η ( x ) = 300 , the influence functions remain stable and are close to zero, which clearly indicate that MHDE is stable.
Additional Analyses: The histogram in Figure 3 suggests that, may be a mixture of normal distributions may fit the log and Box-Cox transformed data better than the normal distribution. For this reason, we calculated the Hellinger distance between four component mixture (chosen using BIC criteria) and the normal distribution and this was determined to be 0.0237, approximately. Thus, the normal distribution (which belongs to the location-scale family) can be viewed as a misspecified target distribution; admittedly, one does lose information about the components of the mixture distribution due to model misspecification. However, since our goal was to estimate the overall mean and variance the proposed estimate seems to possess the properties described in the manuscript.

6. Discussion and Extensions

The results in the manuscript focus on the iterated limit theory for MHDE of the compressed data obtained from a location-scale family. Two pertinent questions arise: (i) is it easy to extend this theory to MHDE of compressed data arising from non location-scale family of distributions? and (ii) is it possible to extend the theory from iterated limits to a double limit? Turning to (i), we note that the heuristic for considering the location-scale family comes from the fact that the first and the second moment are consistently estimable for partially observed random walks (see [29,30]). This is related to the size of J and can be of exponential order. For such large J, other moments may not be consistently estimable. Hence, the entire theory goes through as long as one is considering parametric models f ( · | θ ) , where θ = W ( μ , σ 2 ) , for a known function W ( · , · ) . The case in point is the Gamma distribution which can be re-parametrized in terms of the first two moments.
As for (ii), it is well-known that existence and equality of iterated limits for real sequences does not imply the existence of the double limit unless additional uniformity of convergence holds (see [31] for instance). Extension of this notion for distributional convergence requires additional assumptions and are investigated in a different manuscript wherein more general divergences are also considered.

7. Concluding Remarks

In this paper we proposed the Hellinger distance-based method to obtain robust estimates for mean and variance in a location-scale model using compressed data. Our extensive theoretical investigations and simulations show the usefulness of the methodology and hence can be applied in a variety of scientific settings. Several theoretical and practical questions concerning robustness in a big data setting arise. For instance, the effect of the variability in the R matrix and its effect on outliers are important issues that need further investigation. Furthermore, statistical properties such as uniform consistency and uniform asymptotic normality under different choices for the distribution of R would be useful. These are under investigation by the authors.

Author Contributions

The problem was conceived by E.A., A.N.V. and G.D. L.L. is a student of A.N.V., and worked on theoretical and simulation details with inputs from all members at different stages.

Funding

The authors thank George Mason University Libraries for support with the article processing fees; Ahmed’s research is supported by a grant from NSERC.

Acknowledgments

The authors thank the anonymous reviewers for a careful reading of the manuscript and several useful suggestions that improved the readability of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MHDEMinimum Hellinger Distance Estimator
MHDMinimum Hellinger Distance
i.i.d.independent and identically distributed
MLEMaximum Likelihood Estimator
CIConfidence Interval
IFInfluence Function
RHSRight Hand Side
LHSLeft Hand Side
BFGSBroyden-Fletcher-Goldfarb-Shanno
varVariance
StDStandard Deviation
MSEMean Square Error

References

  1. Beran, R. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5, 445–463. [Google Scholar] [CrossRef]
  2. Lindsay, B.G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Stat. 1994, 22, 1081–1114. [Google Scholar] [CrossRef]
  3. Fisher, R.A. Two new properties of mathematical likelihood. Proc. R. Soc. Lond. Ser. A 1934, 144, 285–307. [Google Scholar] [CrossRef]
  4. Pitman, E.J.G. The estimation of the location and scale parameters of a continuous population of any given form. Biometrika 1939, 30, 391–421. [Google Scholar] [CrossRef]
  5. Gupta, A.; Székely, G. On location and scale maximum likelihood estimators. Proc. Am. Math. Soc. 1994, 120, 585–589. [Google Scholar] [CrossRef]
  6. Duerinckx, M.; Ley, C.; Swan, Y. Maximum likelihood characterization of distributions. Bernoulli 2014, 20, 775–802. [Google Scholar] [CrossRef]
  7. Teicher, H. Maximum likelihood characterization of distributions. Ann. Math. Stat. 1961, 32, 1214–1222. [Google Scholar] [CrossRef]
  8. Thanei, G.A.; Heinze, C.; Meinshausen, N. Random projections for large-scale regression. In Big and Complex Data Analysis; Springer: Berlin, Germany, 2017; pp. 51–68. [Google Scholar]
  9. Slawski, M. Compressed least squares regression revisited. In Artificial Intelligence and Statistics; Addison-Wesley: Boston, MA, USA, 2017; pp. 1207–1215. [Google Scholar]
  10. Slawski, M. On principal components regression, random projections, and column subsampling. Electron. J. Stat. 2018, 12, 3673–3712. [Google Scholar] [CrossRef]
  11. Raskutti, G.; Mahoney, M.W. A statistical perspective on randomized sketching for ordinary least-squares. J. Mach. Learn. Res. 2016, 17, 7508–7538. [Google Scholar]
  12. Ahfock, D.; Astle, W.J.; Richardson, S. Statistical properties of sketching algorithms. arXiv, 2017; arXiv:1706.03665. [Google Scholar]
  13. Vidyashankar, A.; Hanlon, B.; Lei, L.; Doyle, L. Anonymized Data: Trade off between Efficiency and Privacy. 2018; preprint. [Google Scholar]
  14. Woodward, W.A.; Whitney, P.; Eslinger, P.W. Minimum Hellinger distance estimation of mixture proportions. J. Stat. Plan. Inference 1995, 48, 303–319. [Google Scholar] [CrossRef]
  15. Basu, A.; Harris, I.R.; Basu, S. Minimum distance estimation: The approach using density-based distances. In Robust Inference, Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 1997; Volume 15, pp. 21–48. [Google Scholar]
  16. Hooker, G.; Vidyashankar, A.N. Bayesian model robustness via disparities. Test 2014, 23, 556–584. [Google Scholar] [CrossRef]
  17. Sriram, T.; Vidyashankar, A. Minimum Hellinger distance estimation for supercritical Galton–Watson processes. Stat. Probab. Lett. 2000, 50, 331–342. [Google Scholar] [CrossRef]
  18. Simpson, D.G. Minimum Hellinger distance estimation for the analysis of count data. J. Am. Stat. Assoc. 1987, 82, 802–807. [Google Scholar] [CrossRef]
  19. Simpson, D.G. Hellinger deviance tests: Efficiency, breakdown points, and examples. J. Am. Stat. Assoc. 1989, 84, 107–113. [Google Scholar] [CrossRef]
  20. Cheng, A.; Vidyashankar, A.N. Minimum Hellinger distance estimation for randomized play the winner design. J. Stat. Plan. Inference 2006, 136, 1875–1910. [Google Scholar] [CrossRef]
  21. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; Chapman and Hall/CRC: London, UK, 2011. [Google Scholar]
  22. Bhandari, S.K.; Basu, A.; Sarkar, S. Robust inference in parametric models using the family of generalized negative exponential dispatches. Aust. N. Z. J. Stat. 2006, 48, 95–114. [Google Scholar] [CrossRef]
  23. Ghosh, A.; Harris, I.R.; Maji, A.; Basu, A.; Pardo, L. A generalized divergence for statistical inference. Bernoulli 2017, 23, 2746–2783. [Google Scholar] [CrossRef]
  24. Tamura, R.N.; Boos, D.D. Minimum Hellinger distance estimation for multivariate location and covariance. J. Am. Stat. Assoc. 1986, 81, 223–229. [Google Scholar] [CrossRef]
  25. Li, P. Estimators and tail bounds for dimension reduction in l α (0 < α ≤ 2) using stable random projections. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, 20–22 January 2008; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2008; pp. 10–19. [Google Scholar]
  26. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  27. Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 29 March 2019).
  28. Yeh, I.C.; Lien, C.H. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 2009, 36, 2473–2480. [Google Scholar] [CrossRef]
  29. Guttorp, P.; Lockhart, R.A. Estimation in sparsely sampled random walks. Stoch. Process. Appl. 1989, 31, 315–320. [Google Scholar] [CrossRef]
  30. Guttorp, P.; Siegel, A.F. Consistent estimation in partially observed random walks. Ann. Stat. 1985, 13, 958–969. [Google Scholar] [CrossRef]
  31. Apostol, T.M. Mathematical Analysis; Addison Wesley Publishing Company: Boston, MA, USA, 1974. [Google Scholar]
Figure 1. MLE vs. MHDE after Data Compression.
Figure 1. MLE vs. MHDE after Data Compression.
Entropy 21 00348 g001
Figure 2. Comparison of estimates of μ (a) and σ (b) as outlier changes.
Figure 2. Comparison of estimates of μ (a) and σ (b) as outlier changes.
Entropy 21 00348 g002
Figure 3. The histogram of credit payment data after Box-Cox transformation to Normality.
Figure 3. The histogram of credit payment data after Box-Cox transformation to Normality.
Entropy 21 00348 g003
Figure 4. Influence Function of μ ^ (a) and σ ^ (b) for MHDE.
Figure 4. Influence Function of μ ^ (a) and σ ^ (b) for MHDE.
Entropy 21 00348 g004
Table 1. Illustration of Data Reduction Mechanism, Here r i l * = ( r i · l , ω i l ) .
Table 1. Illustration of Data Reduction Mechanism, Here r i l * = ( r i · l , ω i l ) .
Grp 1Grp 2Grp B Grp 1Grp 2Grp B
Original X 11 X 12 X 1 B Compressed ( Y ˜ 11 , r 11 * ) ( Y ˜ 12 , r 12 * ) ( Y ˜ 1 B , r 1 B * )
Data X 21 X 22 X 2 B Data ( Y ˜ 21 , r 21 * ) ( Y ˜ 22 , r 22 * ) ( Y ˜ 2 B , r 2 B * )
S J
X J 1 X J 2 X J B ( Y ˜ S 1 , r S 1 * ) ( Y ˜ S 2 , r S 2 * ) ( Y ˜ S B , r S B * )
Table 2. MHDE as the dimension S changes for compressed data Y ˜ using Gaussian kernel.
Table 2. MHDE as the dimension S changes for compressed data Y ˜ using Gaussian kernel.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
S * = 1 2.0001.0100.0011.01674.035.722
S = 1 2.0001.0140.0011.01874.225.844
S = 2 2.0001.0050.0011.01973.815.832
S = 5 2.0000.9870.0011.01774.165.798
S = 10 2.0000.9950.0011.01971.875.525
Table 3. MHDE as γ 0 changes for compressed data Y ˜ using Gaussian kernel.
Table 3. MHDE as γ 0 changes for compressed data Y ˜ using Gaussian kernel.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
γ 0 = 0.00 2.0001.0100.0011.01674.035.722
γ 0 = 0.01 2.0001.0170.0011.01574.835.814
γ 0 = 0.10 2.0001.0230.0011.02172.805.717
γ 0 = 0.50 2.0001.1190.0011.07672.5911.08
γ 0 = 1.00 2.0001.3990.0021.22682.2157.75
Table 4. MHDE as the bandwidth c B changes for compressed data Y ˜ using Gaussian kernel.
Table 4. MHDE as the bandwidth c B changes for compressed data Y ˜ using Gaussian kernel.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
c B * = 0.30 2.0001.0100.0011.01674.035.722
c B = 0.30 2.0001.0140.0011.01874.225.844
c B = 0.40 2.0001.0150.0011.06379.6810.26
c B = 0.50 2.0001.0140.0011.10882.3318.33
c B = 0.70 2.0001.0040.0011.21293.9653.64
c B = 0.90 2.0001.0090.0011.346110.5132.2
Table 5. MHDE as B changes for compressed data Y ˜ using Gaussian kernel with γ 0 = 0.1 .
Table 5. MHDE as B changes for compressed data Y ˜ using Gaussian kernel with γ 0 = 0.1 .
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
B = 20 2.0002.2050.0051.739378.5688.6
B = 50 2.0001.4090.0021.136125.234.17
B = 100 2.0001.0100.0011.01674.035.722
B = 500 2.0000.4550.0000.97232.631.873
Table 6. MHDE as B changes for compressed data Y ˜ using Gaussian kernel with γ 0 = 0 .
Table 6. MHDE as B changes for compressed data Y ˜ using Gaussian kernel with γ 0 = 0 .
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
B = 20 2.0002.2820.0051.749381.4706.0
B = 50 2.0001.4400.0021.148125.237.42
B = 100 2.0001.0140.0011.01874.225.844
B = 500 2.0000.4650.0000.97331.331.692
Table 7. MHDE as variance changes for compressed data Y ˜ using Gaussian kernel under Gamma sensing variable.
Table 7. MHDE as variance changes for compressed data Y ˜ using Gaussian kernel under Gamma sensing variable.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
v a r = 0.00 2.0001.0100.0011.01674.035.722
v a r = 0.01 2 2.0001.0050.0011.01674.565.806
v a r = 0.01 2.0001.0060.0011.01873.705.762
v a r = 0.25 2.0001.1200.0011.07873.7011.56
v a r = 1.00 2.0001.4380.0011.22881.9458.48
Table 8. MHDE as J changes for compressed data Y ˜ using Gaussian kernel under Bernoulli sensing variable.
Table 8. MHDE as J changes for compressed data Y ˜ using Gaussian kernel under Bernoulli sensing variable.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
J = 10 2.000104.911.011.21597.7855.79
J = 100 1.998104.510.931.201104.551.26
J = 1000 1.998104.710.961.195106.649.36
J = 5000 2.001103.910.801.200105.751.20
J = 10000 1.996105.111.071.196104.449.16
Table 9. MHDE as α changes for contaminated data Y ˜ using Gaussian kernel.
Table 9. MHDE as α changes for contaminated data Y ˜ using Gaussian kernel.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
α * = 0.00 2.0001.0100.0011.01674.035.722
α = 0.00 2.0001.0140.0011.01874.225.844
α = 0.01 2.0001.0020.0011.02274.896.079
α = 0.05 2.0001.0530.0011.02377.866.599
α = 0.10 2.0001.0860.0011.03479.307.350
α = 0.20 2.0001.1460.0011.07393.4514.06
α = 0.30 2.0017.2050.0541.264688.2542.5
α = 0.40 2.02621.601.1003.45418619480
α = 0.50 2.05114.002.6004.809100515513
Table 10. MHDE as α changes for contaminated data Y ˜ using Epanechnikov kernel.
Table 10. MHDE as α changes for contaminated data Y ˜ using Epanechnikov kernel.
μ ^ σ ^
AveStD × 10 3 MSE × 10 3 AveStD × 10 3 MSE × 10 3
α * = 0.00 2.0000.9720.0011.00873.225.425
α = 0.00 2.0001.0140.0011.01874.225.844
α = 0.01 2.0000.9780.0011.028107.412.19
α = 0.05 2.0001.2640.0021.025108.712.35
α = 0.10 2.0001.2020.0011.008114.713.09
α = 0.20 2.0001.2630.0021.046129.818.76
α = 0.30 2.0015.0980.0261.104557.8318.9
α = 0.40 2.02121.800.9003.00419737870
α = 0.50 2.05110.213.0004.893720.415669
Table 11. MHDE from the real data analysis.
Table 11. MHDE from the real data analysis.
μ ^ σ ^
S * = 1 est5.1711.362
95 % CI(4.904, 5.438)(1.158, 1.540)
S = 1 est5.1711.391
95 % CI(4.898, 5.443)(1.183, 1.572)
S = 5 est5.1721.359
95 % CI(4.905, 5.438)(1.155, 1.535)
S = 10 est5.1711.372
95 % CI(4.902, 5.440)(1.167, 1.551)
S = 20 est5.1711.388
95 % CI(4.899, 5.443)(1.180, 1.569)

Share and Cite

MDPI and ACS Style

Li, L.; Vidyashankar, A.N.; Diao, G.; Ahmed, E. Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family. Entropy 2019, 21, 348. https://doi.org/10.3390/e21040348

AMA Style

Li L, Vidyashankar AN, Diao G, Ahmed E. Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family. Entropy. 2019; 21(4):348. https://doi.org/10.3390/e21040348

Chicago/Turabian Style

Li, Lei, Anand N. Vidyashankar, Guoqing Diao, and Ejaz Ahmed. 2019. "Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family" Entropy 21, no. 4: 348. https://doi.org/10.3390/e21040348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop