Next Article in Journal
Investigation of Fluid Characteristic and Performance of an Ejector by a Wet Steam Model
Next Article in Special Issue
Comparing Several Gamma Means: An Improved Log-Likelihood Ratio Test
Previous Article in Journal
Compression-Complexity Measures for Analysis and Classification of Coronaviruses
Previous Article in Special Issue
Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design

1
School of Mathematics, Harbin Institute of Technology, Harbin 150001, China
2
School of Mathematical Sciences, Soochow University, Suzhou 215006, China
3
Department of Industrial Systems Engineering & Management, National University of Singapore, 21 Lowr Kent Ridge Road, Singapore 119077, Singapore
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2023, 25(1), 84; https://doi.org/10.3390/e25010084
Submission received: 8 November 2022 / Revised: 27 December 2022 / Accepted: 28 December 2022 / Published: 31 December 2022
(This article belongs to the Special Issue Recent Advances in Statistical Theory and Applications)

Abstract

:
The optimal subsampling is an statistical methodology for generalized linear models (GLMs) to make inference quickly about parameter estimation in massive data regression. Existing literature only considers bounded covariates. In this paper, the asymptotic normality of the subsampling M-estimator based on the Fisher information matrix is obtained. Then, we study the asymptotic properties of subsampling estimators of unbounded GLMs with nonnatural links, including conditional asymptotic properties and unconditional asymptotic properties.

1. Introduction

In recent years, the amount of information that people need to process is increasing dramatically. It is of great challenge to directly process massive data for statistical analysis. The divide-and-conquer strategy can mitigate the challenge of directly processing such big data [1], but it still consumes considerable computing resources. As a cheaper alternative in computing, subsampling gains its value in the case of limited computing resources.
To reduce the burden on the machine, the subsampling strategy based on big data has been given more attention in recent years. Ref. [2] proposes simple necessary and sufficient conditions for a convolved subsampling estimator to produce a normal limit that matches the target of bootstrap estimation; Ref. [3] provides an optimally distributed subsampling for maximum quasi-likelihood estimators with massive data; Ref. [4] studies some adaptive optimal subsampling algorithms; and Ref. [5] describes a subdata selection method based on leverage scores which conduct the linear model selection on a small subdata set.
GLM is a kind of statistical model with a wide range of applications such as [6,7,8]. Many subsampling studies are based on GLMs such as [3,9,10]. However, the covariates of the subsampled GLMs in the literature are bounded. When dealing with some big data problems, the size of covariate is not strictly bounded, such as the number of clicks on a web page, which can grow infinitely. This requires the extension of existing theories to the unbounded design. To fill this gap, this paper aims to study asymptotic properties of the subsampled GLMs with unbounded covariates based on empirical process and martingale technology.
Our three contributions are shown as follows: (1) we describe the asymptotic property of subsampled M-estimator using Fisher information matrix; (2) we give the conditional consistency and asymptotic normality of unbounded GLMs subsampling estimator; (3) we provide the unconditional consistency and asymptotic normality of unbounded GLMs subsampling estimator.
The rest of the paper is organized as follows. Section 2 introduces the basic concepts in GLMs and subsampling M-estimation problem. Section 3 presents the asymptotical properties for unbounded GLMs subsampling estimators. Section 4 gives the conclusion and discussion, as well as future research directions. All the technical proofs are collected in the Appendix A.

2. Preliminaries

This section introduces the subsampling M-estimation problem and GLMs.

2.1. Subsampling M-Estimation

Let { l ( β ; Z ) R | Z Z } be a set of loss functions with a finite dimensional convex set β Θ R p , and U = { 1 , 2 , , N } be the index of the full large dataset with σ -algebra F N = σ ( Z 1 , , Z N ) , where for each i U , the random data point Z i Z (some probability space) is observed. The empirical risk L N : Θ R is given by L N ( β ) = 1 N i U l β ; Z i .
The goal is to get the solution β ^ N to minimize the risk, namely
β ^ N = arg min β Θ L N ( β ) .
To solve Equation (1), we need β ^ N satisfy: L N ( β ) = 1 N i U l β ; Z i = 0 , and let Σ N : = 2 L N ( β ^ N ) . This is an M-estimation problem; see [11]. For fast solving large-scale estimation in Equation (1), we propose the subsampling M-estimation. Consider an index set S = { i 1 , i 2 , , i n } with replacement from U according to the sampling probability π i i = 1 N such that i = 1 N π i = 1 . The subsampling M-estimation problem is to obtain the solution β ^ n satisfying
L n ( β ) = 0 with L n ( β ) = 1 N n i S 1 π i l β ; Z i ,
where Z i is the i-th time subsample with replacement and π i is the subsampling probability of Z i . For example, if Z 1 = Z 1 , then π 1 = π 1 ; if Z 2 = Z 1 , then π 2 = π 1 . Denote a i as the number of i-th subsampled data such that i U a i = n . And L n ( β ) is constructed by inverse probability weighting skill such that E L n ( β ) F N ] = L N ( β ) ; see [12]. Details about properties of conditional expectation are shown in [13].

2.2. Generalized Linear Models

Let the random variable Y be the distribution of the natural exponential families P α indexed by parameter α ,
P α ( d y ) = d F Y ( y ) = c ( y ) exp { y α b ( α ) } ν ( d y ) , c ( y ) > 0 ,
where α is often referred to as the canonical parameter belonging to its natural space
Λ = { α : c ( y ) exp { y α } ν ( d y ) < } .
ν ( · ) is the Lebesgue measure for continuous distributions (Normal, Gamma) or counting measure for discrete distributions (binomial, Poisson, negative binomial). The c ( y ) is free of α .
Let { ( Y i , X i ) } i = 1 N be N independent sample data pairs. Here the X i R p is covariates and we assume that the response Y i follows the distribution of the natural exponential families with the parameter α i Λ . The covariates X i : = ( x i 1 , , x i p ) T , ( i = 1 , 2 , , N ) are supposed to be deterministic.
The conditional expectation of Y i for a given X i is defined as a function of β T X i after a transformation by a link function α i = ψ ( β T X i ) . The mean value denoted as μ i : = E ( Y i ) is mostly considered for regression.
If α i = β T X i then we call that ψ ( β T X i ) = β T X i is canonical (or natural) link function, and corresponding model is canonical (or natural) GLMs; see Page 32 in [14]. Sometimes the assumption α i = β T X i is somewhat strong and not very suitable in practice, while nonnatural link GLMs allow more flexible choices for the link function. We can further assume that α i and β T X i can be related by a nonnatural link function α i = ψ ( β T X i ) .
Let f β ( Y i | X i ) be the joint density function of the i.i.d. data { ( Y i , X i ) } i = 1 N from the exponential family with a link function ψ ( · ) . Then the nonnatural GLMs [15] is defined by
Y i | X i f β ( Y i | X i ) = c Y i exp Y i ψ β T X i b ψ β T X i , i = 1 , 2 , , N .
Here is a classic result for the exponential family (3),
E ( Y i | X i ) : = μ i = b ˙ ( α i ) = b ˙ ( ψ ( β T X i ) ) a n d Var ( Y i | X i ) : = Var ( Y i ) = b ¨ ( α i ) ,
where i = 1 , 2 , , N ; see P280 in [16].

3. Main Results

3.1. Subsampling M-Estimation Problem

In this part we first look at the term Σ N 1 L n ( β ^ N ) . Define an independent random vector sequence { ζ j } j = 1 N and the subsampled { ζ j } j = 1 n , such that each vector ζ takes the value among { 1 N π i Σ N 1 l ( β ^ N ; Z i ) } i = 1 N , and let
V M ( β ^ N ; n ) = 1 N 2 n Σ N 1 i U 1 π i l β ^ N ; Z i T l β ^ N ; Z i Σ N 1 .
From the definition of L N ( β ) , we have E ( ζ | F N ) = Σ 1 L N ( β ^ N ) = 0 and Var ( ζ | F N ) = n V M ( β ^ N ; n ) . Then we have the asymptotic property of subsampled M-estimator.
Theorem 1.
Suppose that the risk function L N ( β ) is twice differentiable and λ-strongly convex over Θ, that is, for β Θ , 2 L N ( β ) λ I , where ≥ denotes the semidefinite positive ordering; and the sampling-based moment condition,
1 N 4 i = 1 N 1 π i 3 l ( β ^ N ; Z i ) 4 = O P ( 1 ) .
Then we can obtain: As n , conditioning on F N ,
V M ( β ^ N ; n ) 1 2 ( β ^ n β ^ N ) d N ( 0 , I p ) ,
where d means convergence in distribution.
Theorem 1 reveals that the subsampling M-estimation scheme is theoretically feasible under mild conditions. In addition, the existence of the estimator is given by the Fisher information matrix.

3.2. Conditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

The exponential family is very versatile for containing many common light-tail distributions such as binomial, Poisson, negative binomial, normal and Gamma. Along with their attendant convexity properties which leads to finite variance property for log-density, they can serve for a large amount of popular and effective statistical models. It is precisely because of the commonality of these distributions so that we study the subsampling problem for GLMs.
From the loss function introduced in Section 2.1, we set l ( β ; Z i ) : = log f β ( Y i | X i ) where f β ( Y i | X i ) is defined by Equation (2), then the problem solving the minimum of the loss function is equivalent to solve the maximum of the likelihood function. For simplicity, we assume that c ( y ) = 1 , then
l ( β ; Z i ) : = log f β ( Y i | X i ) β = Y i b ˙ ψ β T X i ψ ˙ β T X i X i
with the nonnatural link function α i = ψ ( β T X i ) . We also use this idea in Section 3.3.
More generally, we consider a wider class saying quasi-GLMs, rather than GLMs, which assumes that Equation (4) holds for a certain function μ ( · ) . Strong consistency and asymptotic normality of quasi maximum likelihood estimate in GLMs with bounded covariates are proved in [17]. For unbounded covariates, adopting the subsampled estimation of GLMs in [9], we calculate the inverse probability weighted estimator of β by solving the estimating equation based on the subsampled index set S,
1 N n i S 1 π i Y i μ ψ β T X i ψ ˙ β T X i X i = 0 .
where { ( Y i , X i ) } i S is subsampled data. Equivalently, we have
s n ( β ) = i S 1 π i Y i μ ψ β T X i ψ ˙ β T X i X i = 0 .
Equation (6) is called quasi-GLMs since Equation (4) is given instead of the distribution function.
Let β ^ n be the estimator of the real parameter β 0 in subsampled quasi-GLMs and β ^ N be the estimator of β 0 in quasi-GLMs with full data. For the unbounded quasi-GLMs with full data, β ^ N is asymptotic unbiased with respect to β 0 ; see [18]. Next, we focus on the asymptotical properties of β ^ n , as shown in the following theorems.
Theorem 2.
Let { ( Y i , X i ) } i S be subsampled from i.i.d. full data { ( Y i , X i ) } i U . Consider the Equation (4) and (6) where ψ ( · ) is three times continuously differentiable whose every derivative is bounded, and b ( · ) is twice continuously differentiable whose every derivative is also bounded. Assume that:
(A.1) 
The range of the unknown parameter β is an open subset of R p .
(A.2) 
For any i S , E sup β Θ 1 π i | Y i μ ( ψ ( β T X i ) ) | | F N = O ( 1 ) .
(A.3) 
For any β Θ and i S , 0 < inf i φ ( β T X i ) sup i φ ( β T X i ) < , where φ ( t ) = [ ψ ˙ ( t ) ] 2 b ¨ ( ψ ( t ) ) .
(A.4) 
For any β 1 Θ and β 2 Θ , there exists a function | m ( X i ) | < such that
| φ ( β 1 T X i ) φ ( β 2 T X i ) | | m ( X i ) | | β 1 T X i β 2 T X i | .
(A.5) 
When n , max i S X i T ( X X T ) 1 X i = O ( n 1 ) and λ min [ X X T ] , where X = ( X 1 , . . . , X n ) and λ min [ A ] is the smallest eigenvalue of the matrix A .
(A.6) 
min i = 1 , . . . , N ( N π i ) = O ( 1 ) , max i = 1 , . . . , N ( N π i ) = O ( 1 ) .
Then β ^ n is consistent with β ^ N , i.e.,
β ^ n β ^ N = o P | F N ( 1 )
where o P | F N ( 1 ) means o ( 1 ) conditioning on F N in probability.
Theorem 3.
Under the conditions in Theorem 2, as N and n , conditional on F N in probability,
n ( β ^ n β ^ N ) N ( 0 , V s ) ,
in distribution, where
V s = Σ N 1 V N Σ N 1 , Σ N = i U a i Y i b ˙ ( ψ ( β ^ N T X i ) ) ψ ¨ ( β ^ N T X i ) X i X i T i U a i b ¨ ( ψ ( β ^ N T X i ) ) ψ ˙ ( β ^ N T X i ) 2 X i X i T , V N = i U a i π i Y i b ˙ ( ψ ( β ^ N T X i ) ) 2 ψ ˙ ( β ^ N T X i ) 2 X i X i T .
In this part, we complete the asymptotic properties without the moment condition of the covariates { X i } i = 1 N which is used in [9], and that means X i ’s are unbounded. Here we only provide the theoretical asymptotic results. Furthermore, the subsampling probability can be derived by A-optimal criterion like [10].

3.3. Unconditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

In real engineering, the measurement of some response variable data is very expensive, such as superconductor data, deep space exploration data, etc. The accuracy of estimating the target parameters under measurement constraints of responses is a very important issue. Ref. [19] completed the unconditional asymptotic properties of parameter estimation in bounded GLMs with canonical link. But the unbounded GLMs with nonnatural link situation has not been discussed yet.
In this section, we continue to use the notations of Section 3.2. Through the theory of empirical process [11], we obtain the unconditional consistency of β ^ n in the following theorem.
Theorem 4.
(Unconditional subsampled consistency) Assume the conditions:
(B.1) 
λ min ( E X X T ) > 0 where X is the unbounded covariate of GLMs.
(B.2) 
For u 1 , u 2 [ 0 , 1 ] ,
inf β Θ \ { β 0 } E { b ¨ ( ψ ˜ u 1 ) ψ ˙ [ ( 1 u 2 ) ( β 0 T X ) + u 2 ( β T X ) ] 2 ( β T X β 0 T X ) 2 } E ( β T X β 0 T X ) 2 C 1 > 0 ,
where ψ ˜ u 1 = ( 1 u 1 ) ψ ( β 0 T X ) + u 1 ψ ( β T X ) and b ¨ ( · ) is the second derivative with respect to β .
(B.3) 
E β 0 sup β Θ [ | Y b ˙ ( ψ ( β T X ) ) | · | | X | | 2 ] < ,
where b ˙ ( · ) is the first derivative with respect to β .
(B.4) 
ψ ( · ) in (3) is twice continuously differentiable and its every derivative has a positive minimum.
(B.5) 
b ( · ) in (3) is twice continuously differentiable and its every derivative has a positive minimum.
Then β ^ n β 0 = o P ( 1 ) .
Theorem 4 directly obtains the unconditional consistency of the subsampling estimator with respect to the true parameters under the unbounded assumption.
To prove the asymptotic normality of β ^ n with respect to β 0 , we briefly review the subsampled score function in Section 3.2
s n ( β ) = i S 1 π i Y i μ ψ β T X i ψ ˙ β T X i X i : = i S 1 π i ϕ β ( X i , Y i ) .
Next we will apply a multivariate martingale central limit theorem (Lemma 4 in [19]), which is the extension of Theorem A.1 in [20], to show the asymptotic normality of β ^ n . Let { F N , i } i = 1 n be a filtration adaptive to the sampling: F N , 0 = σ ( X 1 N , Y 1 N ) ; F N , 1 = σ ( X 1 N , Y 1 N ) σ ( 1 ) ; ; F N , i = σ ( X 1 N , Y 1 N ) σ ( 1 ) σ ( i ) ; , where σ ( i ) is the σ -algebra generated by ith sampling step. The subsample of size n is assumed to increase with N. By the filtration, we define the martingale
M ¯ : = i = 1 n M ¯ i : = i = 1 n 1 π i ϕ β ( X i , Y i ) j = 1 N ϕ β ( X j , Y j ) ,
where { M ¯ i } i = 1 n is a martingale difference sequence adapted to { F N , i } i = 1 n . In addition, define Q : = n j = 1 N ϕ β ( X j , Y j ) ; T : = s n ( β ) = M ¯ + Q ; ξ N i : = Var 1 / 2 ( T ) M ¯ i and B N : = Var 1 / 2 ( T ) Var ( M ¯ ) Var 1 / 2 ( T ) , where matrix A 1 / 2 is the symmetric square root of A , i.e., A = ( A 1 / 2 ) 2 , and A 1 / 2 = ( A 1 / 2 ) 1 = ( A 1 ) 1 / 2 . B N is the variance of Var 1 / 2 ( T ) M ¯ .
The following theorem shows the asymptotic normality of the estimator β ^ n .
Theorem 5.
Assume the conditions,
(C.1) 
Φ = E ( s n ( β ) ) = E i S 1 π i μ ˙ ( ψ ( β T X i ) ) [ ψ ˙ ( β T X i ) ] 2 X i X i T
is finite and nonsingular.
(C.2) 
E i U a i π i μ ˙ ( ψ ( β T X i ) ) [ ψ ˙ ( β T X i ) ] 2 X i k X i j 2 = o P ( 1 ) , for 1 k , j p ,where X i k means k-th element of vector X i and X i j means j-th element of vector X i .
(C.3) 
ψ ( x ) is three-times continuously differentiable for every x with its domain.
(C.4) 
For any i S , | | ϕ ¨ β ( X i , Y i ) | | < .
(C.5) 
min i = 1 , . . . , N ( N π i ) = max i = 1 , . . . , N ( N π i ) = O ( 1 ) and n / N = o ( 1 ) .
(C.6) 
lim N i = 1 n E [ | | ξ N i | | 4 ] = 0 ,
(C.7) 
lim N E i = 1 n E [ ξ N i ξ N i T | F N , i 1 ] B N 2 = 0 .
Then
Var ( T ) 1 / 2 Φ ( β ^ n β 0 ) d N ( 0 , I p ) .
Here, we establish the unconditional asymptotic properties of subsampling estimator for unbounded GLMs. The condition n / N = o ( 1 ) ensures that small-scale subsamples also have expected performance, which greatly release the computational cost. We also present the theoretical asymptotic results, which leads to the subsampling probability using the A-optimal criterion in [10].

4. Conclusions and Future Work

In this paper, we derive the asymptotic normality of the subsampling M-estimator by Fisher information. In the unbounded GLMs with nonnatural link function, we separately obtain the conditional and unconditional asymptotic properties of subsampling estimator.
For future study, it is meaningful to apply the sub-Weibull concentration inequalities in [21] to make nonasymptotic inference. The importance sampling is not ideal, since it tends to assign high sampling probability to the observed samples. Hence, effective subsampling methods are considered for GLMs, such as Markov subsampling in [22]. Moreover, high-dimensional methods in [23,24] for subsampling need further studies.

Author Contributions

Conceptualization, B.T.; Methodology, Y.Z.; Validation, G.T.; Writing—original draft, G.T.; Writing—review & editing, B.T., Y.Z. and S.F.; Supervision, B.T.; Funding acquisition, Y.Z. and B.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key University Science Research Project of Jiangsu Province 21KJB110023 and National Natural Science Foundation of China 91646106.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank Huiming Zhang for helpful discussions on large sample theory.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Details

Lemma A1
(Theorem 4.17 in [16]). Let X 1 , , X N be i.i.d. from a p.d.f. f β w.r.t. a σ-finite measure ν on ( R , B R ) , where β Θ and Θ is an open set in R p . Suppose that for every x in the range of X 1 , f β ( x ) is twice continuously differentiable on β and satisfies
(D.1) 
β ψ β ( x ) d ν = β ψ β ( x ) d ν
for ψ β ( x ) = f β ( x ) and ψ β ( x ) = f β ( x ) β .
(D.2) 
The Fisher information matrix
I 1 ( β ) = E β log f β ( X 1 ) β log f β ( X 1 ) T
is positive definite.
(D.3) 
For any given β Θ , there exists a positive number C β and a positive function h β such that E [ h β ( X 1 ) ] < and
sup γ : γ β < C β 2 log f γ ( x ) γ γ T h β ( x )
for all x in the range of X 1 , where | | · | | is Euclidean norm and A = tr ( A T A ) for any matrix A . Then there exist a sequence of estimators β ^ N (based on X i , i U ) such that
P ( s a ( β ^ N ) = 0 ) 1 and β ^ N P β 0 ,
where s a ( γ ) = log L ˜ N ( γ ) γ and L ˜ N ( γ ) is the likelihood function of full data and β 0 is the real parameter. Meanwhile, there exist a sequence of estimators β ^ n (based on X i , i S ) such that
P ( s s ( β ^ n ) = 0 ) 1 and β ^ n P β 0 ,
where s s ( γ ) = log L ˜ n ( γ ) γ and L ˜ n ( γ ) is the likelihood function of subsampled data and β 0 is the real parameter.
Let a i be the number of i-th subsampled data such that i U a i = n .
Lemma A2.
E L n ( β ) F N ] = L N ( β ) .
Proof. 
From the definition of a i , one has
E [ L n ( β ) | F N ] = E 1 N n i S 1 π i l ( β ; Z i ) | F N = E 1 N n i U 1 π i l ( β ; Z i ) a i | F N = 1 N n i U a i E 1 π i l ( β ; Z i ) | F N = 1 N n i U a i i U 1 π i l ( β ; Z i ) π i i U π i = 1 N n i U a i i U l ( β ; Z i ) = 1 N n n i U l ( β ; Z i ) = 1 N i U l ( β ; Z i ) = L N ( β ) .
Proposition A1.
Under the conditions of Lemma A1 and
min i ( N π i ) = max i ( N π i ) = O ( 1 ) ( i = 1 , , N ) .
Assume that β ^ N based on X i , i U is an estimator of β , and β ^ n based on X i , i S is also an estimator of β , then
β ^ n β ^ N = Σ N 1 L n ( β ^ N ) + o P | F N ( 1 ) .
Proof. 
Taking Taylor series expansion of L n ( β ^ n ) around β ^ N , we have
0 = L n ( β ^ n ) = L n ( β ^ N ) + 2 L n ( β ^ N ) ( β ^ n β ^ N ) + o ( β ^ n β ^ N ) = L n ( β ^ N ) + 2 L n ( β ^ N ) ( β ^ n β ^ N ) + 2 L N ( β ^ N ) ( β ^ n β ^ N ) 2 L N ( β ^ N ) ( β ^ n β ^ N ) + o ( β ^ n β ^ N ) = L n ( β ^ N ) + 2 L N ( β ^ N ) ( β ^ n β ^ N ) + ( 2 L n ( β ^ N ) 2 L N ( β ^ N ) ) ( β ^ n β ^ N ) + o ( β ^ n β ^ N ) .
From the definition of a i , one has
( 2 L n ( β ^ N ) 2 L N ( β ^ N ) ) ( β ^ n β ^ N ) = 1 N n i S 1 π i 2 l ( β ^ N ; Z i ) 1 N i U 2 l ( β ^ N ; Z i ) · ( β ^ n β ^ N ) = i U 1 N n π i 2 l ( β ^ N ; Z i ) a i 1 N i U 2 l ( β ^ N ; Z i ) · ( β ^ n β ^ N ) = i U a i n π i N n π i 2 l ( β ^ N ; Z i ) ( β ^ n β ^ N ) i U a i N n π i 2 l ( β ^ N ; Z i ) ( β ^ n β ^ N ) = o P | F N ( 1 ) .
Combine Equations (A1), (A2) and (A5) into Equation (A4), one has
0 = L n ( β ^ N ) + 2 L N ( β ^ N ) ( β ^ n β ^ N ) + o P | F N ( 1 ) .
This can be transformed to
β ^ n β ^ N = Σ N 1 L n ( β ^ N ) + o P | F N ( 1 ) .
The proposition is proved. □
Remark A1.
The last equation in the proof ensures that β ^ n β ^ N + Σ N 1 L n ( β ^ N ) is a higher-order infinitesimal with respect to 1, which is true according to conditional probability with F N . o P | F N ( 1 ) in Equation (A6) is denoted as the higher order infinitesimal of 1 according to conditional probability with F N .
Proof of Theorem 1.
For every constant γ ^ > 0 , one has
j S E n 1 2 ζ j 2 I ( ζ j > n 1 2 γ ^ ) | F N j S E n 1 2 2 ζ j 4 n γ ^ 2 I ( ζ j > n 1 2 γ ^ ) | F N = 1 n 2 γ ^ 2 j S E ζ j 4 I ( ζ j > n 1 2 γ ^ ) | F N 1 n 2 γ ^ 2 j S E ζ j 4 | F N = 1 n 2 γ ^ 2 i U a i E ζ i 4 | F N = 1 n 2 γ ^ 2 n i U ζ i 4 π i i U π i = 1 n γ ^ 2 i U 1 N π i Σ N 1 l ( β ^ N ; Z i ) 4 π i = 1 n γ ^ 2 1 N 4 i U 1 π i 3 Σ N 1 l ( β ^ N ; Z i ) 4 1 n γ ^ 2 1 N 4 i U 1 π i 3 1 λ 4 l ( β ^ N ; Z i ) 4 = 1 n γ ^ 2 1 λ 4 O ( 1 ) = o ( 1 ) .
Furthermore,
j S Cov ( n 1 2 ζ j | F N ) = j S E n 1 2 ζ j E ( n 1 2 ζ j | F N ) n 1 2 ζ j E ( n 1 2 ζ j | F N ) T | F N = j S E [ ( n 1 2 ζ j ) ( n 1 2 ζ j ) T | F N ] = 1 n j S E ( ζ j ζ j T | F N ) = 1 n n E ( ζ ζ T | F N ) = Var ( ζ | F N ) .
Then, by the Lindeberg-Feller central limit theorem (Proposition 2.27 of [11]), conditional on F N ,
j S n 1 2 ζ j d N ( 0 , Var ( ζ | F N ) ) .
Therefore, combining the above and Proposition A1, Equation (5) holds. Thus, the proof is completed. □
Proof of Theorem 2.
Next, one needs to show convexity (i.e., uniqueness and maximum value) due to the existence of the estimators from [25]. Let
I 1 ( β ) = E ( s n ( β ) ) = E i S 1 π i [ Y i μ ( ψ ( β T X i ) ) ] ψ ˙ ( β T X i ) X i β = E ( i S 1 π i μ ˙ ( ψ ( β T X i ) ) [ ψ ˙ ( β T X i ) ] 2 X i X i T + i S 1 π i [ Y i μ ( ψ ( β T X i ) ) ] ψ ¨ ( β T X i ) X i X i T ) = i S 1 π i μ ˙ ( ψ ( β T X i ) ) [ ψ ˙ ( β T X i ) ] 2 X i X i T ,
where
s n ( β ) = i S 1 π i Y i μ ψ β T X i ψ ˙ β T X i X i .
From [16] in Theorem 4.17, one needs to show
max γ G ( C 0 ) I 1 ( β ^ N ) 1 / 2 s n ( γ ) I 1 ( β ^ N ) 1 / 2 + I p P | F N 0 ,
where G ( C 0 ) = γ : I 1 ( β ^ N ) 1 / 2 ( γ β ^ N ) C 0 and I p = diag ( 1 , 1 , , 1 ) is a p-dimensional identity matrix.
Let
M n ( γ ) = i S 1 π i [ ψ ˙ ( γ T X i ) ] 2 b ¨ ( ψ ( γ T X i ) ) X i X i T
and
R n ( γ ) = i S 1 π i [ Y i μ ( ψ ( γ T X i ) ) ] ψ ¨ ( γ T X i ) X i X i T .
Then
s n ( γ ) = R n ( γ ) M n ( γ )
and
I 1 ( γ ) = E ( s n ( γ ) ) = M n ( γ )
Thus, one only needs to prove
max γ G ( C 0 ) M n ( β ^ N ) 1 / 2 M n ( γ ) M n ( β ^ N ) M n ( β ^ N ) 1 / 2 P | F N 0 ,
and
max γ G ( C 0 ) M n ( β ^ N ) 1 / 2 R n ( γ ) M n ( β ^ N ) 1 / 2 P | F N 0
for any C 0 > 0 . From the definition of M n ( γ ) , and the property of trace in P288 of [16], the left-hand side of Equation (A11) can be bounded by
p max γ G ( C 0 ) , i S 1 φ γ T X i / φ β ^ N T X i .
From condition (A.4), one needs to prove γ T X i β ^ N T X i converges to 0 so that Equation (A11) holds, and one has
γ T X i β ^ N T X i 2 = ( γ T β ^ N T ) I 1 ( β ^ N ) 1 / 2 I 1 ( β ^ N ) 1 / 2 X i 2 I 1 ( β ^ N ) 1 / 2 ( γ β ^ N ) 2 I 1 ( β ^ N ) 1 / 2 X i 2 C 0 2 max i S X i T I 1 ( β ^ N ) 1 X i = C 0 2 max i S X i T M n ( β ^ N ) 1 X i = C 0 2 max i S X i T i S 1 π i [ ψ ˙ ( ^ } β N T X i ) ] 2 b ¨ ( ψ ( ^ } β N T X i ) ) X i X i T 1 X i = C 0 2 max i S X i T i S 1 π i φ ( ^ } β N T X i ) X i X i T 1 X i = C 0 2 max i S X i T i S N 1 N π i φ ( ^ } β N T X i ) X i X i T 1 X i C 0 2 N min i S 1 N π i inf i S φ β ^ N T X i 1 max i S X i T i S X i X i T 1 X i = C 0 2 N min i S 1 N π i inf i S φ β ^ N T X i 1 max i S X i T X X T 1 X i P | F N 0 .
Hence Equation (A11) holds. Let e i = Y i μ ( ψ ( β ^ N T X i ) ) , and
U n ( γ ) = i S 1 π i μ ( ψ ( β ^ N T X i ) ) μ ψ γ T X i ψ ¨ γ T X i X i X i T , V n ( γ ) = i S e i π i ψ ¨ γ T X i ψ ¨ ( β ^ N T X i ) X i X i T , W n ( β ^ N ) = i S e i π i ψ ¨ ( β ^ N T X i ) X i X i T .
Then R n ( γ ) = U n ( γ ) + V n ( γ ) + W n ( β ^ N ) . In the same way as proving Equation (A11), we have
max γ G ( C 0 ) M n ( β ^ N ) 1 / 2 U n ( γ ) M n ( β ^ N ) 1 / 2 P | F N 0 .
Note that M n ( β ^ N ) 1 / 2 V n ( γ ) M n ( β ^ N ) 1 / 2 is bounded by the product of
M n ( β ^ N ) 1 2 i S e i π i X i X i T M n ( β ^ N ) 1 2
and
max γ G ( C 0 ) , i S ψ ¨ γ T X i ψ ¨ β ^ N T X i .
Equation (A13) can be bounded as
M n ( β ^ N ) 1 2 i S e i π i X i X i T M n ( β ^ N ) 1 2 = I 1 ( β ^ N ) 1 2 i S e i π i X i X i T I 1 ( β ^ N ) 1 2 i S e i π i [ I 1 ( β ^ N ) ] 1 2 X i 2 i S e i π i N min i S 1 N π i inf i S φ β ^ N T X i 1 max i S X i T X X T 1 X i i S e i max i S 1 N π i min i S 1 N π i inf i S φ β ^ N T X i 1 max i S X i T X X T 1 X i i U Y i μ ( ψ ( β ^ N T X i ) ) max i S 1 N π i min i S 1 N π i inf i S φ β ^ N T X i 1 max i S X i T X X T 1 X i i U sup β Θ Y i μ ( ψ ( β T X i ) ) max i S 1 N π i min i S 1 N π i inf i S φ β ^ N T X i 1 · max i S X i T X X T 1 X i = 1 n i S E sup β Θ 1 π i | Y i μ ( ψ ( β T X i ) ) | | F N max i S 1 N π i min i S 1 N π i inf i S φ β ^ N T X i 1 · max i S X i T X X T 1 X i = O P | F N ( 1 / n ) ,
where the penultimate equal sign applies the Lemma A2 with
l ( β ) = sup β Θ Y i μ ( ψ ( β T X i ) ) .
Equation (A14) can be bounded as
max γ G ( C 0 ) , i S ψ ¨ γ T X i ψ ¨ β ^ N T X i P | F N 0 ,
which can be proved as the same argument of Equation (A11) by Lagrange mean value theorem. Combine the bounds of Equations (A13) and (A14), one obtains
max γ G ( C 0 ) M n ( β ^ N ) 1 / 2 V n ( γ ) M n ( β ^ N ) 1 / 2 P | F N 0 .
Let δ ( 0 , 1 ) be a constant. Since sup i S E ( | e i | 1 + δ | F N ) < , one has
i S E e i π i ψ ¨ β ^ N T X i X i T M n ( β ^ N ) 1 X i 1 + δ | F N i S E e i π i 1 + δ | F N · max i S ψ ¨ β ^ N T X i 1 + δ · X i T M n ( β ^ N ) 1 X i 1 + δ i S 1 π i 1 + δ E e i 1 + δ | F N max i S ψ ¨ β ^ N T X i 1 + δ · X i T i S N 1 N π i φ ( ^ } β N T X i ) X i X i T 1 X i 1 + δ = i S 1 N π i 1 + δ E e i 1 + δ | F N max i S ψ ¨ β ^ N T X i 1 + δ · X i T i S 1 N π i φ ( ^ } β N T X i ) X i X i T 1 X i 1 + δ C δ i S X i T X X T 1 X i 1 + δ C δ i S X i T X X T 1 X i max i S X i T X X T 1 X i δ = C δ max i S X i T X X T 1 X i δ i S tr X i T X X T 1 X i = C δ max i S X i T X X T 1 X i δ i S tr X X T 1 X i X i T = C δ max i S X i T X X T 1 X i δ tr X X T 1 i S X i X i T = C δ max i S X i T X X T 1 X i δ tr X X T 1 X X T = C δ max i S X i T X X T 1 X i δ tr I p = p C δ max i S X i T X X T 1 X i δ P | F N 0 ,
where C δ > 0 is a constant. Under the definition of W n ( β ^ N ) and E ( e i | F N ) = 0 , together with Theorem 1.14(ii) in [16], one obtains
M n ( β ^ N ) 1 2 W n ( β ^ N ) M n ( β ^ N ) 1 2 P | F N 0 .
Hence, Equation (A12) holds and the proof is completed. □
Proof of Theorem 3.
According to the mean value theorem, one has
0 = s n ( β ^ n ) = s n ( β ^ N ) + s n ( β ¯ ¯ ) ( β ^ n β ^ N ) ,
where β ¯ ¯ is between β ^ n and β ^ N , then
n ( β ^ n β ^ N ) = n s n ( β ¯ ¯ ) 1 s n ( β ^ N ) .
Let q i ( β ^ N ) = 1 π i Y i μ ( ψ ( β ^ N T X i ) ) ψ ˙ ( β ^ N T X i ) X i , then
i S q i ( β ^ N ) = i S 1 π i Y i μ ( ψ ( β ^ N T X i ) ) ψ ˙ ( β ^ N T X i ) X i = s n ( β ^ N ) .
According to E ( Y i | F N ) = μ ( ψ ( β ^ N T X i ) ) in Equation (4), one obtains
E ( q i ( β ^ N ) | F N ) = 1 π i E ( Y i | F N ) μ ( ψ ( β ^ N T X i ) ) ψ ˙ ( β ^ N T X i ) X i = 0 .
Applying Lindeberg-L e ´ vy CLT, one has
s n ( β ^ N ) n d N ( 0 , Var ( q i ( β ^ N ) | F N ) ) ,
where
Var ( q i ( β ^ N ) | F N ) = E ( q i ( β ^ N ) q i ( β ^ N ) T | F N ) = i U a i π i Y i b ˙ ( ψ ( β ^ N T X i ) ) 2 ψ ˙ ( β ^ N T X i ) 2 X i X i T .
Applying [26] in Theorem 2, one has
s n ( β ¯ ¯ ) n = 1 n i S q i ( β ¯ ¯ ) β ¯ ¯ a . s . E q i ( β ¯ ¯ ) β ¯ ¯ | F N ,
where
E q i ( β ¯ ¯ ) β ¯ ¯ | F N = i U a i Y i b ˙ ( ψ ( β ¯ ¯ T X i ) ) ψ ¨ ( β ¯ ¯ T X i ) X i X i T i U a i b ¨ ( ψ ( β ¯ ¯ T X i ) ) ψ ˙ ( β ¯ ¯ T X i ) 2 X i X i T .
Since β ¯ ¯ is between β ˜ n and β ^ N , and β ˜ n is consistent with β ^ N with respect to F N in probability, then
s n ( β ¯ ¯ ) n P | F N E q i ( β ^ N ) β ^ N | F N ,
where
E q i ( β ^ N ) β ^ N | F N = i U a i Y i b ˙ ( ψ ( β ^ N T X i ) ) ψ ¨ ( β ^ N T X i ) X i X i T i U a i b ¨ ( ψ ( β ^ N T X i ) ) ψ ˙ ( β ^ N T X i ) 2 X i X i T .
At last, combining Equations (A15)–(A17) by Slutsky’s theorem, one obtains
n ( β ^ n β ^ N ) d N ( 0 , V s ) ,
where V s = E q i ( β ^ N ) β ^ N | F N 1 Var ( q i ( β ^ N ) | F N ) E q i ( β ^ N ) β ^ N | F N 1 = Σ N 1 V N Σ N 1 . The proof is completed. □
Proof of Theorem 4.
Here, one needs to prove the consistency of β ^ n with respect to β 0 due to the existence of β ^ n ; see [27].
Denote p β ( X , y ) : = exp { y ψ ( β T X ) b ( ψ ( β T X ) ) } , m β ( X , y ) = log p β ( X , y ) : = y ψ ( β T X ) b ( ψ ( β T X ) ) and φ ˜ ( β T X ) = b ˙ [ ψ ( β T X ) ] ψ ˙ ( β T X ) . Then the negative K-L divergence in [28] is bounded,
D K L ( P β 0 | | P β ) : = E β 0 ( m β m β 0 ) = E { ( E β 0 y | X ) [ ψ ( β T X ) ψ ( β 0 T X ) ] b ( ψ ( β T X ) ) + b ( ψ ( β 0 T X ) ) } = E { b ˙ [ ψ ( β 0 T X ) ] [ ψ ( β T X ) ψ ( β 0 T X ) ] b ( ψ ( β T X ) ) + b ( ψ ( β 0 T X ) ) } ( t 1 [ 0 , 1 ] ) = E { b ˙ [ ψ ( β 0 T X ) ] [ ψ ( β T X ) ψ ( β 0 T X ) ] b ˙ [ ( 1 t 1 ) ψ ( β T X ) + t 1 ψ ( β 0 T X ) ] [ ψ ( β T X ) ψ ( β 0 T X ) ] } ( t 2 [ 0 , 1 ] ) = E { b ¨ [ ( 1 t 2 ) ψ ( β 0 T X ) + ( 1 t 1 ) t 2 ψ ( β T X ) + t 1 t 2 ψ ( β 0 T X ) ] · [ ψ ( β 0 T X ) ( 1 t 1 ) ψ ( β T X ) t 1 ψ ( β 0 T X ) ] [ ψ ( β T X ) ψ ( β 0 T X ) ] } = ( 1 t 1 ) E { b ¨ [ ( 1 t 3 ) ψ ( β 0 T X ) + t 3 ψ ( β T X ) ] [ ψ ( β T X ) ψ ( β 0 T X ) ] 2 } ( t 4 [ 0 , 1 ] ) = ( 1 t 1 ) E { b ¨ [ ( 1 t 3 ) ψ ( β 0 T X ) + t 3 ψ ( β T X ) ] · ψ ˙ [ ( 1 t 4 ) ( β 0 T X ) + t 4 ( β T X ) ] 2 ( β T X β 0 T X ) 2 } By ( B . 4 ) and ( B . 5 ) ] ( 1 t 1 ) C 1 E ( β T X β 0 T X ) 2 = ( 1 t 1 ) C 1 ( β β 0 ) T ( E X X T ) ( β β 0 ) ( 1 t 1 ) C 1 λ min ( E X X T ) | | β β 0 | | 2 By ( B . 1 ) ] ( 1 t 1 ) C 1 C 2 | | β β 0 | | 2 ,
where t 3 = t 2 t 1 t 2 [ 0 , 1 ] and C 2 > 0 . Then for any ε > 0 , one has the well-separation condition
sup | | β β 0 | | 2 ε E β 0 m β ( X , y ) < E β 0 m β 0 ( X , y ) .
Let M ˜ n ( β ) : = 1 n i = 1 n m β ( X i , Y i ) , which is essentially a logarithmic likelihood function of subsampled GLMs, and β ^ n is the function’s maximum point. Thus, one has the nearly maximization M ˜ n ( β ^ n ) M ˜ n ( β 0 ) M ˜ n ( β 0 ) o P ( 1 ) .
Let F : = { m β ( X , y ) = y ψ ( β T X ) + b ( ψ ( β T X ) ) , β Θ } . Now one obtains
| m β 1 ( X , y ) m β 2 ( X , y ) | = | y ψ ( β 1 T X ) + b ( ψ ( β 1 T X ) ) + y ψ ( β 2 T X ) b ( ψ ( β 2 T X ) ) | = | y ψ ( β 1 T X ) b ( ψ ( β 1 T X ) ) y ψ ( β 2 T X ) + b ( ψ ( β 2 T X ) ) | = | | y ψ ˙ ( ξ ( 5 ) T X ) ( β 1 T X β 2 T X ) X b ˙ ( ψ ( ξ ( 6 ) T X ) ) ψ ˙ ( ξ ( 6 ) T X ) ( β 1 T X β 2 T X ) X | | C 4 | y b ˙ ( ψ ( ξ ( 6 ) T X ) ) | · | β 1 T X β 2 T X | · | | X | | C 4 | y b ˙ ( ψ ( ξ ( 6 ) T X ) ) | · | | X | | 2 · | | β 1 β 2 | | , β 1 , β 2 Θ ,
where ξ ( 5 ) and ξ ( 6 ) are both between β 1 and β 2 and C 4 > 0 .
Let m ¯ ( X , y ) = | y b ˙ ( ψ ( ξ ( 6 ) T X ) ) | · | | X | | 2 and by (B.3), one has
| | m ¯ ( X , y ) | | P , 1 : = E β 0 | m ¯ ( X , Y ) | E β 0 sup β Θ [ | y b ˙ ( ψ ( β T X ) ) | · | | X | | 2 ] < ,
where | | · | | P ˜ , 1 = P ˜ | · | is the L 1 ( P ˜ ) -norm in P269-P270 of [11] and P ˜ : = E β 0 . And then from the Example 19.7 in [11], one obtains
N [ ] ε , F , L 1 ( E β 0 ) K diam Θ ε / | | m ¯ | | E β 0 , 1 p < , every 0 < ε < diam Θ <
where N [ ] ε , F , L 1 ( E β 0 ) is called bracketing number which is the minimum number of ε -brackets needed to cover F ; see P270 in [11]. And K is a constant, and diam Θ = sup β 1 , β 2 Θ | | β 1 β 2 | | .
Therefore, the class F is P-Glivenko-Cantelli by Theorem 19.4 in [11]. And from the definition of P-Glivenko-Cantelli in P269 of [11], we have
sup β Θ | M ˜ n ( β ) E β 0 m β ( X , y ) | a . s . 0 .
Finally, according to Theorem 5.7 in [11], we get β ^ n β 0 = o P ( 1 ) . The proof is then completed. □
Recall (A7) and (A8) respectively, then s n ( γ ) = R n ( γ ) M n ( γ ) . Let Φ = E ( s n ( β ) ) , then we have the following lemma.
Lemma A3.
For β R p , assume that
(E.1) 
  R n ( β ) is finite and nonsingular.
(E.2) 
For 1 k , j p ,
E i S a i π i [ Y i μ ( ψ ( β T X i ) ) ] ψ ¨ ( β T X i ) x i k x i j 2 = o ( 1 ) .
(E.3) 
For 1 k , j p ,
Var i U a i π i [ ψ ˙ ( β T X i ) ] 2 μ ˙ ( ψ ( β T X i ) ) x i k x i j = o ( 1 ) .
Then,
s n ( β ) Φ .
Proof. 
One derives each entry in the matrix by
( s n ( β ) ) k j = ( R n ( β ) ) k j ( M n ( β ) ) k j = i S 1 π i [ Y i μ ( ψ ( β T X i ) ) ] ψ ¨ ( β T X i ) x i k x i j i S 1 π i [ ψ ˙ ( β T X i ) ] 2 μ ˙ ( ψ ( β T X i ) ) x i k x i j .
By the definition of Φ , one has
Φ k j = E ( s n ( β ) ) k j = E i S 1 π i μ ˙ ( ψ ( β T X i ) ) [ ψ ˙ ( β T X i ) ] 2 x i k x i j .
Next, one obtains
E ( s n ( β ) ) k j Φ k j 2 = E ( s n ( β ) ) k j Φ k j 2 | ( X i , Y i ) i = 1 N = E ( s n ( β ) ) k j E ( s n ( β ) ) k j 2 | ( X i , Y i ) i = 1 N = E ( R n ( β ) ) k j ( M n ( β ) ) k j + E ( M n ( β ) ) k j 2 | ( X i , Y i ) i = 1 N = E ( R n ( β ) ) k j 2 + E ( M n ( β ) ) k j ( M n ( β ) ) k j 2 | ( X i , Y i ) i = 1 N + E 2 ( R n ( β ) ) k j E ( M n ( β ) ) k j ( M n ( β ) ) k j | ( X i , Y i ) i = 1 N = E ( R n ( β ) ) k j 2 | ( X i , Y i ) i = 1 N + Var ( M n ( β ) ) k j | ( X i , Y i ) i = 1 N = o ( 1 ) ,
where the first equality is based on the fact that after conditioning on the N data points, the n repeating sampling steps should be independent and identically distributed in each step. The last equality holds by the conditions (E.2) and (E.3). □
Lemma A4.
Under the conditions (C.1)–(C.5) in Theorem 5, if s n ( β ^ n ) = 0 for all large n and | | β ^ n β 0 | | = O P ( 1 / N ) , then
s n ( β 0 ) = Φ ( β ^ n β 0 ) + o P ( 1 ) .
Proof. 
By Taylor’s expansion:
0 = s n ( β ^ n ) = s n ( β 0 ) + s n ( β 0 ) ( β ^ n β 0 ) + 1 2 ( β ^ n β 0 ) T Σ ( β ˜ n ) ( β ^ n β 0 ) ,
where Σ ( β ˜ n ) = 2 s n ( β ˜ n ) and β ˜ n is between β 0 and β ^ n . From assumption (C.3), (C.4) and (C.5) in Theorem 5, we have
Σ ( β ˜ n ) = i S 1 π i ϕ ¨ β ( X i , Y i ) i S 1 π i · ϕ ¨ β ( X i , Y i ) = O ( n N ) .
Then 1 2 ( β ^ n β 0 ) T Σ ( β ˜ n ) ( β ^ n β 0 ) = o P ( 1 ) . Therefore, by Lemma A3, one has
0 = s n ( β 0 ) + ( Φ + o ( 1 ) ) ( β ^ n β 0 ) + o P ( 1 ) ,
which implies
s n ( β 0 ) = Φ ( β ^ n β 0 ) + o P ( 1 ) .
Hence, the proof is completed. □
Lemma A5.
{ M ¯ i } i = 1 n is a martingale difference sequence adapt to the filtration { F N , i } i = 1 n .
Proof. 
The M ¯ i ’s are F N , i -measurable by the definition of M ¯ i and the definition of the filtration { F N , i } i = 1 n . Then we obtain
E [ M ¯ i | F N , i 1 ] = E 1 π i ϕ β ( X i , Y i ) j = 1 N ϕ β ( X j , Y j ) | F N , i 1 = E 1 π i ϕ β ( X i , Y i ) | F N , i 1 E j = 1 N ϕ β ( X j , Y j ) | F N , i 1 = i = 1 N π i 1 π i ϕ β ( X i , Y i ) i = 1 N π i i = 1 N π i j = 1 N ϕ β ( X j , Y j ) i = 1 N π i = i = 1 N ϕ β ( X i , Y i ) j = 1 N ϕ β ( X j , Y j ) = 0 .
By the definition of martingale difference sequence in P230 of [29], the proof is completed. □
Under the definition of T , M ¯ , Q , it is obvious that Var ( T ) = Var ( M ¯ ) + Var ( Q ) .
Lemma A6.
sup N λ max ( B N ) 1 .
Proof. 
By symmetry of B N , we only to show for any N, I B N is positive definite.
I B N = Var ( T ) 1 2 ( Var ( T ) Var ( M ¯ ) ) Var ( T ) 1 2 = Var ( T ) 1 2 Var ( Q ) Var ( T ) 1 2 .
Therefore, I B N is equivalent to the positive definite matrix Var ( Q ) . The proof is completed. □
Lemma A7
(Multivariate version of martingale CLT, Lemma 4 in [19]). For k = 1 , 2 , 3 , , let { ξ k i ; i = 1 , 2 , , N k } be a martingale difference sequence in R p relative to the filtration { F k i ; i = 0 , 1 , , N k } and let Y k R p be an F k 0 -measurable random vector. Set S k = i = 1 N k ξ k i . Assume that
(F.1) 
lim k i = 1 N k E [ ξ k i 4 ] = 0 ;
(F.2) 
lim k E i = 1 N k [ ξ k i ξ k i T | F k , i 1 ] B k 2 = 0 for some sequence of positive definite matrices { B k } k = 1 with sup k λ m a x ( B k ) < i.e., the largest eigenvalue is uniformly bounded;
(F.3) 
For some probability distribution L 0 , ∗ denotes convolution and L ( · ) denotes the law of random variates:
L ( Y k ) N ( 0 , B k ) d L 0 .
Then
L ( Y k + S k ) d L 0 .
Lemma A8
(Asymptotic normality of s n ( β 0 ) ). Assume that
(G.1.) 
lim N i = 1 n E [ | | ξ N i | | 4 ] = 0 ;
(G.2.) 
lim N E i = 1 n E [ ξ N i ξ N i T | F N , i 1 ] B N 2 = 0 .
Then
Var ( T ) 1 2 · T d N ( 0 , I p ) .
Proof. 
The conditions in lemma A7 can be substituted with
ξ k i = ξ N i , Y k = Var ( T ) 1 2 · Q , B k = B N , L 0 N ( 0 , I p ) .
By Lemma A5, conditions (F.1) and (F.2) of Lemma A7 are satisfied. Next we only need to show the third condition in Lemma A7 holds. According to central limit theorem we have
Var ( Q ) 1 2 · Q d N ( 0 , I p ) .
For any t R p , let t ˜ = Var ( T ) 1 2 t , X ˜ = i Q , due to the properties of the complex multivariate normal distributions are equivalent to the properties of real multivariate normal distributions in P222 of [30], and E Q = 0 , one has
Var ( X ˜ ) = Var ( i Q ) = E [ ( i Q ) ( i Q ) ] [ E ( i Q ) ] 2 = E Q 2 = E Q 2 + ( E Q ) 2 = Var ( Q ) .
Thus, according to Equations (45.4)–(45.6) in P108 of [30], one has
E e t ˜ T X ˜ = e t ˜ T E ( X ˜ ) + 1 2 t ˜ T Var ( X ˜ ) t ˜ = e 1 2 t T Var ( T ) 1 2 Var ( Q ) Var ( T ) 1 2 t .
Further, we obtain
E [ e i t T Var ( T ) 1 2 Q ] · e 1 2 t T Var ( T ) 1 2 Var ( M ¯ ) Var ( T ) 1 2 t = e 1 2 t T t .
Therefore, condition (F.3) in Lemma A7 is verified. Then one obtains
Var ( T ) 1 2 T = Var ( T ) 1 2 · Q + Var ( T ) 1 2 · M ¯ d N ( 0 , I p ) .
The proof is completed. □
Proof of Theorem 5.
According to Lemma A4,
Φ ( β ^ n β 0 ) + o P ( 1 ) = s n ( β 0 ) = T .
Multiplying with Var ( T ) 1 2 in (A18), one obtains
Var ( T ) 1 2 Φ ( β ^ n β 0 ) + o P ( | | Var ( T ) 1 2 | | ) = Var ( T ) 1 2 T .
Applying Lemma A8, one obtains
Var ( T ) 1 2 Φ ( β ^ n β 0 ) d N ( 0 , I p ) .
The proof is completed. □

References

  1. Xi, R.; Lin, N. Direct regression modelling of high-order moments in big data. Stat. Its Interface 2016, 9, 445–452. [Google Scholar] [CrossRef] [Green Version]
  2. Tewes, J.; Politis, D.N.; Nordman, D.J. Convolved subsampling estimation with applications to block bootstrap. Ann. Stat. 2019, 47, 468–496. [Google Scholar] [CrossRef] [Green Version]
  3. Yu, J.; Wang, H.; Ai, M.; Zhang, H. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 2022, 117, 265–276. [Google Scholar] [CrossRef]
  4. Yao, Y.; Wang, H. A review on optimal subsampling methods for massive datasets. J. Data Sci. 2021, 19, 151–172. [Google Scholar] [CrossRef]
  5. Yu, J.; Wang, H. Subdata selection algorithm for linear model discrimination. Stat. Pap. 2021, 63, 1883–1906. [Google Scholar] [CrossRef]
  6. Fu, S.; Chen, P.; Liu, Y.; Ye, Z. Simplex-based Multinomial Logistic Regression with Diverging Numbers of Categories and Covariates. Stat. Sin. 2022, in press. [Google Scholar] [CrossRef]
  7. Ma, J.; Xu, J.; Maleki, A. Analysis of sensing spectral for signal recovery under a generalized linear model. Adv. Neural Inf. Process. Syst. 2021, 34, 22601–22613. [Google Scholar]
  8. Mahmood, T. Generalized linear model based monitoring methods for high-yield processes. Qual. Reliab. Eng. Int. 2020, 36, 1570–1591. [Google Scholar] [CrossRef]
  9. Ai, M.; Yu, J.; Zhang, H.; Wang, H. Optimal Subsampling Algorithms for Big Data Regressions. Stat. Sin. 2021, 31, 749–772. [Google Scholar] [CrossRef]
  10. Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef]
  11. van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: London, UK, 1998. [Google Scholar]
  12. Wooldridge, J.M. Inverse probability weighted M-estimators for sample selection, attrition, and stratification. Port. Econ. J. 2002, 1, 117–139. [Google Scholar] [CrossRef]
  13. Durret, R. Probability: Theory and Examples, 5th ed.; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
  14. McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
  15. Fahrmeir, L.; Kaufmann, H. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Stat. 1985, 13, 342–368. [Google Scholar] [CrossRef]
  16. Shao, J. Mathematical Statistics, 2nd ed.; Springer: New York, NY, USA, 2003. [Google Scholar]
  17. Yin, C.; Zhao, L.; Wei, C. Asymptotic normality and strong consistency of maximum quasi-likelihood estimates in generalized linear models. Sci. China Ser. A 2006, 49, 145–157. [Google Scholar] [CrossRef]
  18. Rigollet, P. Kullback-Leibler aggregation and misspecified generalized linear models. Ann. Stat. 2012, 40, 639–665. [Google Scholar] [CrossRef] [Green Version]
  19. Zhang, T.; Ning, Y.; Ruppert, D. Optimal sampling for generalized linear models under measurement constraints. J. Comput. Graph. Stat. 2021, 30, 106–114. [Google Scholar] [CrossRef]
  20. Ohlsson, E. Asymptotic normality for two-stage sampling from a finite population. Probab. Theory Relat. Fields 1989, 81, 341–352. [Google Scholar] [CrossRef]
  21. Zhang, H.; Wei, H. Sharper Sub-Weibull Concentrations. Mathematics 2022, 10, 2252. [Google Scholar] [CrossRef]
  22. Gong, T.; Dong, Y.; Chen, H.; Dong, B.; Li, C. Markov Subsampling Based on Huber Criterion. IEEE Trans. Neural Netw. Learn. Syst. 2022, in press. [Google Scholar] [CrossRef]
  23. Xiao, Y.; Yan, T.; Zhang, H.; Zhang, Y. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J. Inequalities Appl. 2020, 2020, 252. [Google Scholar] [CrossRef]
  24. Zhang, H.; Jia, J. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signals detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
  25. Ding, J.L.; Chen, X.R. Large-sample theory for generalized linear models with non-natural link and random variates. Acta Math. Appl. Sin. 2006, 22, 115–126. [Google Scholar] [CrossRef]
  26. Jennrich, R.I. Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 1969, 40, 633–643. [Google Scholar] [CrossRef]
  27. White, H. Maximum likelihood estimation of misspecified models. Econom. J. Econom. Soc. 1982, 50, 1–25. [Google Scholar] [CrossRef]
  28. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  29. Davidson, J. Stochastic Limit Theory: An Introduction for Econometricians; OUP Oxford: Oxford, UK, 1994. [Google Scholar]
  30. Kotz, S.; Balakrishnan, N.; Johnson, N.L. Continuous Multivariate Distributions, Volume 1: Models and Applications, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2000. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Teng, G.; Tian, B.; Zhang, Y.; Fu, S. Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design. Entropy 2023, 25, 84. https://doi.org/10.3390/e25010084

AMA Style

Teng G, Tian B, Zhang Y, Fu S. Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design. Entropy. 2023; 25(1):84. https://doi.org/10.3390/e25010084

Chicago/Turabian Style

Teng, Guangqiang, Boping Tian, Yuanyuan Zhang, and Sheng Fu. 2023. "Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design" Entropy 25, no. 1: 84. https://doi.org/10.3390/e25010084

APA Style

Teng, G., Tian, B., Zhang, Y., & Fu, S. (2023). Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design. Entropy, 25(1), 84. https://doi.org/10.3390/e25010084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop