Abstract
We consider the likelihood ratio test of a simple null hypothesis (with density ) against a simple alternative hypothesis (with density ) in the situation that observations are mismeasured due to the presence of measurement errors. Thus instead of for we observe with unobservable parameter and unobservable random variable . When we ignore the presence of measurement errors and perform the original test, the probability of type I error becomes different from the nominal value, but the test is still the most powerful among all tests on the modified level. Further, we derive the minimax test of some families of misspecified hypotheses and alternatives. The test exploits the concept of pseudo-capacities elaborated by Huber and Strassen (1973) and Buja (1986). A numerical experiment illustrates the principles and performance of the novel test.
1. Introduction
Measurement technologies are often affected by random errors; if the goal of the experiment is to compare two probability distributions using data, then the conclusion can be distorted if the data are affected by some measurement errors. If the data are mismeasured due to the presence of measurement errors, the statistical inference performed with them is biased and trends or associations in the data are deformed. This is common for a broad spectrum of applications e.g., in engineering, physics, biomedicine, molecular genetics, chemometrics, econometrics etc. Some observations can be even undetected, e.g., in measurements of magnetic or luminous flux in analytical chemistry when the flux intensity falls below some flux limit. Actually, we can hardly imagine real data free of measurement errors; the question is how severe the measurement errors are and what their influence on the data analysis is [1,2,3].
A variety of functional models have been proposed for handling measurement errors in statistical inference. Technicians, geologists, and other specialists are aware of this problem, and try to reduce the effect of measurement errors with various ad hoc procedures. However, this effect cannot be completely eliminated or substantially reduced unless we have some additional knowledge on the behavior of measurement errors.
There exists a rich literature on the statistical inference in the error-in-variables (EV) models as is evidenced by the monographs of Fuller [4], Carroll et al. [5], and Cheng and van Ness [6], and the references therein. The monographs [4] and [6] deal mostly with classical Gaussian set up while [5] discusses numerous inference procedure under semi-parametric set up. Nonparametric methods in EV models are considered in [7,8] and in references therein, and in [9], among others. The regression quantile theory in the area of EV models was started by He and Liang [10]. Arias [11] used an instrumental variable estimator for quantile regression, considering biases arising from unmeasured ability and measurement errors. The papers dealing with practical aspects of measurement error models include [12,13,14,15,16], among others. Recent developments in treating the effect of measurement errors on econometric models was presented in [17] or [18] The advantage of rank and signed rank procedures in the measurement errors models was discovered recently in [19,20,21,22,23,24]. The problem of interest in the present paper is to study how the measurement errors can affect the conclusion of the likelihood ratio test.
The distribution function of measurement errors is considered unknown, up to zero expectation and unit variance. When we use the the likelihood ratio test while ignoring the possible measurement errors, we can suffer a loss in both errors of the first and second kind. However, we show that under a small variance of measurement errors, the original likelihood ratio test is still most powerful, only on a slightly changed significance level.
On the other hand, we may consider the situation that or are classes of distributions of random variables Hence, both hypothesis and alternative are composite as families and if they are bounded by alternating Choquet capacities of order 2, then we can look for a minimax test based on the ratio of the capacities, and/over on the ratio of the pair of the least favorable distributions of and , respectively (cf. Huber and Strassen [25]).
2. Likelihood Ratio Test under Measurement Errors
Our primary goal is to test the null hypothesis that independent observations come from a population with a density f against the alternative that the true density is where f and g are fixed densities of our interest. For the identifiability, we shall assume that f and g are continuous and symmetric around 0. Although the alternative is the main concern of the experimenter, some measurement errors or just the nature may cause the situation that the true alternative should be considered as composite. Specifically, can be affected by additive measurement errors, what appears in numerous fields, as illustrated in Section 1.
Hence the alternative is under which the observations are identically distributed with continuous density Here, both under the hypothesis and under the alternative, are independent random variables, unobservable with unknown distribution, independent of The parameter is also unknown, only we assume that and for simplicity. The mismeasured, hence unobservable, are assumed to have the density g under the alternative. Quite analogously, the mismeasured observations lead to a composite hypothesis under which the density of observations is while the are assumed to have density f.
If we knew and we would use the Neyman-Pearson critical region
with u determined so that
with a significance level Evidently
Indeed, notice that
where the expectations are considered with respect to the conditional distribution; a similar equality holds for
Combining the integration transmission in the conditional distribution, we obtain
hence the size of the critical region W when used for testing against differs from Then we ask how the critical region W in (1) behaves when it is used as a test of This problem we shall try to attack with an expansion of in close to zero.
Approximations of Densities
Put the densities of X under the hypotheses and alternative, respectively. For the identifiability, we shall assume that and are continuous and symmetric around 0. Denote the density of This means that X is affected by an additive measurement error where V is independent of X and Notice that if densities of X and V are strongly unimodal, then that of Z is also strongly unimodal (see [26]). Under some additional conditions on we shall derive approximations of and for small More precisely, we assume that both and have differentiable and integrable derivatives up to order 5. Then we have the following expansion of and a parallel result for :
Theorem 1.
Assume that and are symmetric around 0, strongly unimodal with differentiable and integrable derivatives, up to the order 5. Then, as
Proof.
Let be the characteristic function of Then
where denotes the characteristic function of V. Taking the inverse Fourier transform on both sides, we obtain (3), taking the above assumptions on V into account. □
Consider the problem of testing the hypothesis that the observations are distributed according to density against the alternative that they are distributed according to density Parallelly, we consider the hypothesis that observations are distributed according the against the alternative that the true density is Let be the likelihood ratio test with critical region and the significance level and be the test with critical region based on observations We know neither nor hence the test is just an application of the critical region W for contaminated data Thus, due to our lack of information, we use the test even for testing against and the performance of this test is of interest. This is described in the following theorem:
Theorem 2. (Assume the conditions of Theorem 1).
Then, as the test is the most powerful even for testing against with a modified significance level satisfying
Proof.
If is symmetric, then the derivative is symmetric for k even and skew-symmetric for k odd, Moreover, because and are integrable, then and Hence, using the expansion (3), we obtain
□
3. Robust Testing
If the observations are missmeasured or contaminated, we observe with unknown and unobservable V instead of Z. Hence, instead of simple and we are led to composite hypothesis and alternative and . Following [25], we can try to find suitable 2-alternating capacities, dominating and and to construct a pertaining minimax test. As before, we assume that Z and V are independent, , and Moreover, we assume that and are symmetric, strongly unimodal and differentiable up to order 5, with derivatives integrable and increasing distribution functions and respectively. The measurement errors V are assumed to satisfy
with a fixed Hence the distribution of V is restricted to have the tails lighter than t-distribution with 4 degrees of freedom. We shall construct a pair of 2-alternating capacities around specific subfamilies of and
Let us determine the capacity around ; that for is analogous. By Theorem 1 we have
We shall concentrate on the following family of densities (similarly for ):
with fixed suitable
Indeed, under our assumptions, each is a positive and symmetric density satisfying
for some
Let be the probability distribution induced by density with being the Borel -algebra. Then the set function
is a pseudo-capacity in the sense of Buja [27], i.e., satisfying
- (a)
- (b)
- (c)
- (d)
- (e)
Analogously, consider a density symmetric around 0 and satisfying the assumptions of Theorem 1 as a simple hypothesis. Construct the family of densities and the corresponding family of distributions similarly as above. Then the set function
is a pseudo-capacity in the sense of Buja [27].
Buja [27] showed that on any Polish space exists a (possibly different) topology which generates the same Borel algebra and on which every pseudo-capacity is a 2-alternating capacity in the sense of [25].
Let us now consider the problem of testing the hypothesis against the alternative based on an independent random sample Assume that and satisfy (5). Then, following [27] and [25], we have the main theorem providing the minimax test of against with significance level
Theorem 3.
The test
where is a version of and and γ are chosen so that is a minimax test of against of level
4. Numerical Illustration
We assume to observe independent observations for , where as described in Section 3, where are independent identically distributed (with a distribution function F) but unobserved. Let us further denote by the distribution function of and by the distribution function of . The primary task here is to test against
with a fixed and . We perform all the computations using the R software [28].
To describe our approach to computing the test, we will need the notation for the set of pseudo-distribution functions corresponding to the set of pseudo-densities denotes as
where denotes the distribution function of distribution. Under the alternative, the set analogous to is defined as
Our task is to approximate
and
Here, the functions and are evaluated over a grid with step 0.05. Then, the maximization in (8) and (9) is performed for values of z over the grid and over four boundary values of , which are equal to , , , and . Additional computations with 10 randomly selected pairs of over and revealed that the optimum is attained in one of the boundary values. Further, the Radon-Nikodym derivatives of V and W are estimated by a finite difference approximation in order to compute the test statistic.
The test rejects if the test statistics exceeds a critical value, which (as well as the p-value) can be approximated by a Monte Carlo simulation, i.e., by a repeated random generating random variables under , and we generate them 10,000 times here.
We perform the following particular numerical study. We compute the critical value of the -test for (or ), , , , , and . Further, we are interested in evaluating the probability of rejecting this test for data generated from
with different values of and . Its values are shown in Table 1 (for ) and Table 2 (for ), which are approximated using (again) 10,000 randomly generated variables from (10). The boldface numbers are equal to the power of the test (under the simple ). The proposed test seems meaningful, while its power is increased for compared to ; in addition, the power increases with an increasing if is retained; and the power also increases with an increasing if is retained.
Table 1.
Probability of rejecting the test in the simulation with .
Table 2.
Probability of rejecting the test in the simulation with .
5. Conclusions
The likelihood ratio test of against is considered in the situation that observations are mismeasured due to the presence of measurement errors. Thus instead of for we observe with unobservable parameter and unobservable random variable . When we ignore the presence of measurement errors and perform the original test, the probability of type I error becomes different from the nominal value, but the test is still the most powerful among all tests on the modified level.
Under some assumptions on and and for we further construct a minimax likelihood ratio test of some families of distributions of the based on the capacities of the Huber-Strassen type. The test treats the composite null and alternative hypotheses, which cover all possible measurement errors satisfying the assumptions. The advantage of the novel test is that it keeps the probability of type I error below the desired value () across all possible measurement errors. The test is performed in a straightforward way, while the user must specify particular (not excessively large) values of and K. We do not consider this a limiting requirement, because parameters corresponding to the severity of measurement errors are commonly chosen in a similar way in numerous measurement error models [5,23] or robust optimization procedures [29]. The critical value of the test can be approximated by a simulation. The numerical experiment in Section 4 illustrates the principles and performance of the novel test.
Author Contributions
Methodology, M.B. and J.J.; Software, J.K.; Writing—Original Draft Preparation, M.B. and J.J.; Writing—Review & Editing, M.B., J.J. and J.K.; Funding Acquisition, M.B.
Funding
The research of Jana Jurečková was supported by the Grant 18-01137S of the Czech Science Foundation. The research of Jan Kalina was supported by the Grant 17-01251S of the Czech Science Foundation.
Acknowledgments
The authors would like to thank two anonymous referees for constructive advice.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Boyd, A.; Lankford, H.; Loeb, S.; Wyckoff, J. Measuring test measurement error: A general approach. J. Educ. Behav. Stat. 2013, 38, 629–663. [Google Scholar] [CrossRef]
- Brakenhoff, T.B.; Mitroiu, M.; Keogh, R.H.; Moons, K.G.M.; Groenwold, R.H.H.; van Smeden, M. Measurement error is often neglected in medical literature: A systematic review. J. Clin. Epidemiol. 2018, 98, 89–97. [Google Scholar] [CrossRef] [PubMed]
- Edwards, J.K.; Cole, S.R.; Westreich, D. All your data are always missing: Incorporating bias due to measurement error into the potential outcomes framework. Int. J. Epidemiol. 2015, 44, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
- Fuller, W.A. Measurement Error Models; John Wiley & Sons: New York, NY, USA, 1987. [Google Scholar]
- Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; Crainiceanu, C.M. Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
- Cheng, C.L.; van Ness, J.W. Statistical Regression with Measurement Error; Arnold: London, UK, 1999. [Google Scholar]
- Carroll, R.J.; Maca, J.D.; Ruppert, D. Nonparametric regression in the presence of measurement error. Biometrika 1999, 86, 541–554. [Google Scholar] [CrossRef]
- Carroll, R.J.; Delaigle, A.; Hall, P. Non-parametric regression estimation from data contaminated by a mixture of Berkson and classical errors. J. R. Stat. Soc. B 2007, 69, 859–878. [Google Scholar] [CrossRef]
- Fan, J.; Truong, Y.K. Nonparametric regression estimation involving errors-in-variables. Ann. Stat. 1993, 21, 23–37. [Google Scholar] [CrossRef]
- He, X.; Liang, H. Quantile regression estimate for a class of linear and partially linear errors-in-variables models. Stat. Sin. 2000, 10, 129–140. [Google Scholar]
- Arias, O.; Hallock, K.F.; Sosa-Escudero, W. Individual heterogeneity in the returns to schooling: Instrumental variables quantile regression using twins data. Empir. Econ. 2001, 26, 7–40. [Google Scholar] [CrossRef]
- Hyk, W.; Stojek, Z. Quantifying uncertainty of determination by standard additions and serial dilutions methods taking into account standard uncertainties in both axes. Anal. Chem. 2013, 85, 5933–5939. [Google Scholar] [CrossRef]
- Kelly, B.C. Some aspects of measurement error in linear regression of astronomical data. Astrophys. J. 2007, 665, 1489–1506. [Google Scholar] [CrossRef]
- Marques, T.A. Predicting and correcting bias caused by measurement error in line transect sampling using multiplicative error model. Biometrics 2004, 60, 757–763. [Google Scholar] [CrossRef] [PubMed]
- Rocke, D.M.; Lorenzato, S. A two-component model for measurement error in analytical chemistry. Technometrics 1995, 37, 176–184. [Google Scholar] [CrossRef]
- Akritas, M.G.; Bershady, M.A. Linear regression for astronomical data with measurement errors and intrinsic scatter. Astrophys. J. 1996, 470, 706–728. [Google Scholar] [CrossRef]
- Hausman, J. Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econ. Perspect. 2001, 15, 57–67. [Google Scholar] [CrossRef]
- Hyslop, D.R.; Imbens, Q.W. Bias from classical and other forms of measurement error. J. Bus. Econ. Stat. 2001, 19, 475–481. [Google Scholar] [CrossRef]
- Jurečková, J.; Picek, J.; Saleh, A.K.M.E. Rank tests and regression rank scores tests in measurement error models. Comput. Stat. Data Anal. 2010, 54, 3108–3120. [Google Scholar] [CrossRef]
- Jurečková, J.; Koul, H.L.; Navrátil, R.; Picek, J. Behavior of R-estimators under Measurement Errors. Bernoulli 2016, 22, 1093–1112. [Google Scholar] [CrossRef]
- Navrátil, R.; Saleh, A.K.M.E. Rank tests of symmetry and R-estimation of location parameter under measurement errors. Acta Univ. Palacki. Olomuc. Fac. Rerum Nat. Math. 2011, 50, 95–102. [Google Scholar]
- Navrátil, R. Rank tests and R-estimates in location model with measurement errors. In Proceedings of Workshop of the Jaroslav Hájek Center and Financial Mathematics in Practice I; Masaryk University: Brno, Czech Republic, 2012; pp. 37–44. [Google Scholar]
- Saleh, A.K.M.E.; Picek, J.; Kalina, J. R-estimation of the parameters of a multiple regression model with measurement errors. Metrika 2012, 75, 311–328. [Google Scholar] [CrossRef]
- Sen, P.K.; Jurečková, J.; Picek, J. Rank tests for corrupted linear models. J. Indian Stat. Assoc. 2013, 51, 201–230. [Google Scholar]
- Huber, P.; Strassen, V. Minimax tests and the Neyman-Pearson lemma for capacities. Ann. Stat. 1973, 2, 251–273. [Google Scholar] [CrossRef]
- Ibragimov, I.A. On the composition of unimodal distributions. Theor. Probab. Appl. 1956, 1, 255–260. [Google Scholar] [CrossRef]
- Buja, A. On the Huber-Strassen theorem. Probab. Theory Relat. Fields 1986, 73, 149–152. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017; Available online: https://www.R-project.org/ (accessed on 15 September 2018).
- Xanthopoulos, P.; Pardalos, P.M.; Trafalis, T.B. Robust Data Mining; Springer: New York, NY, USA, 2013. [Google Scholar]
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).