A High-Dimensional Cramér–von Mises Test

Zhang, Danna; Xu, Mengyu

doi:10.3390/math12223467

Open AccessFeature PaperArticle

A High-Dimensional Cramér–von Mises Test

by

Danna Zhang

¹

and

Mengyu Xu

^2,*

¹

Department of Mathematics, University of California San Diego, San Diego, CA 92093, USA

²

Department of Statistics and Data Science, University of Central Florida, Orlando, FL 32816, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(22), 3467; https://doi.org/10.3390/math12223467

Submission received: 13 August 2024 / Revised: 31 October 2024 / Accepted: 2 November 2024 / Published: 6 November 2024

(This article belongs to the Special Issue New Challenges in Time Series and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The Cramér–von Mises test provides a useful criterion for assessing goodness of fit in various problems. In this paper, we introduce a novel Cramér–von Mises-type test for testing distributions of high-dimensional continuous data. We establish an asymptotic theory for the proposed test statistics based on quadratic functions in high-dimensional stochastic processes. To estimate the limiting distribution of the test statistic, we propose two practical approaches: a plug-in calibration method and a subsampling method. Theoretical justifications are provided for both techniques. Numerical simulation also confirms the convergence of the proposed methods.

Keywords:

Cramér–von Mises statistic; goodness-of-fit test; high-dimensional data; resampling; linear combination of chi-squared distributions

MSC:

62E20; 62G09; 62H10

1. Introduction

Goodness-of-fit tests are crucial in statistical inference. They allow researchers to assess whether a proposed distribution accurately reflects the observed data. These fundamental goodness-of-fit problems have had a long history. The Cramér–von Mises test [1,2] provides a useful criterion for continuous distributions. For independent and identically distributed random variables (

Y_{1}, \dots, Y_{n}

) from a distribution with a cumulative distribution function (

F_{Y}

), to test

H_{0} : = F_{Y} (y) = F_{0, Y} (y),

(1)

where

F_{0, Y}

is pre-specified, the Cramér–von Mises test statistic uses a squared distance between the empirical distribution function (

{\hat{F}}_{Y}

) and the given distribution (

F_{0, Y}

):

Θ = n \int_{- \infty}^{\infty} {[{\hat{F}}_{Y} (y) - F_{0, Y} (y)]}^{2} d F_{0, Y} (y),

which is distribution-free if

F_{0, Y}

is continuous [3]. With the transformed random variables (

U_{i} = F_{0, Y} (Y_{i})

) taking values in the range of

[0, 1]

, one can equivalently test whether

U_{1}, \dots, U_{n}

follow the uniform

[0, 1]

distribution. For

\hat{F} (u) = n^{- 1} \sum_{i = 1}^{n} 1_{U_{i} \leq u}

, the Cramér–von Mises statistic can also be written as

\begin{matrix} Θ_{1} = n \int_{0}^{1} {[\hat{F} (u) - u]}^{2} d u = \frac{1}{n} \sum_{i, j = 1}^{n} \int_{0}^{1} (1_{U_{i} \leq u} - u) (1_{U_{j} \leq u} - u) d u . \end{matrix}

(2)

It has been well established that the empirical process of

\sqrt{n} [\hat{F} (u) - u]

converges weakly to a Gaussian process with covariance function

min {u, u^{'}} - u u^{'}

under the null hypothesis. As a consequence, it holds that

\begin{matrix} Θ_{1} \Rightarrow \sum_{h = 1}^{\infty} c_{h} η_{h}, \end{matrix}

(3)

where

c_{h} = {(π h)}^{- 2}

and

η_{h}

’s are independent

χ_{1}^{2}

random variables (see Chapter 5 in [4]). Some variants of (2) with the incorporation of different weight functions have been further discussed (see, for example, [5,6], among others).

The information era has witnessed an explosion in the collection of high-dimensional data across a wide range of areas, where the number of dimensions can be comparable to or even much larger than the sample size. Model fitting and distribution verification represent a basic step in statistical inference of high-dimensional data [7,8]. Some progress has been made in the testing of high-dimensional multinomials for discrete data (see, for example, [9,10]). For continuous distributions, the goodness-of-fit test of distributions is much less understood in the high-dimensional case (see Zhang and Wu [11] for a Kolmogorov–Smirnov-type test and Liang et al. [12] for the special case of testing for high-dimensional normality).

In multivariate cases with fixed dimensions, Cramér–von Mises tests have also been investigated in a considerable body of literature. Rosenblatt [13] considered the transformation of writing the joint distribution into products of conditional distributions to facilitate Kolmogorov–Smirnov and Cramér–von Mises tests. The Cramér–von Mises test for independence has been studied by Blum et al. [14], Cotterill and Csörgö [15], Cotterill et al. [16], and Genest and Rémillard [17], Genest et al. [18], among others.

However, technical challenges arise when the dimension (d) is large compared to the sample size (n). Specifically, Portnoy [19] showed that the multivariate central limit theorem is generally not valid when d is large such that

\sqrt{n} = o (d)

. Heuristic argument and theory are inadequate in the high-dimensional case. Nevertheless, interestingly, we will show that a result similar to (3) can still be valid by using new technical tools to study the distribution of quadratic functions of high-dimensional stochastic processes.

An important direction in fitting and testing multivariate distributions is based on copula modeling, which has gained remarkable popularity in the last decade. Inspired by Sklar’s representation [20], one can decompose inference of the multivariate distribution into the modeling of marginal distributions and the copula (see Genest and Nešlehová [21], Joe [22] for more on copulas). A variety of tests have been developed to test the dependence structures by given copulas (see [23] for a survey and implementation). The majority of these tests focus on cases with a small number of dimensions. Hering and Hofert [24] considered testing Archimedean copulas in a high-dimensional setting. While copula modeling has receiving increasing attention, a simultaneous goodness-of-fit test on the margins in the high-dimensional setting remains unsolved. In this paper, we shall provide another solution to the goodness-of-fit problem by introducing Cramér–von Mises-type criteria, which can be used to test for marginal distributions and copulas for multivariate continuous data, allowing the dimension (d) to grow with the sample size (n). Our focus is on the marginal tests.

One primary goal of the paper is to rigorously establish an asymptotic theory of the Cramér–von Mises-type test statistics for marginal distributions and copulas when the dimension is large. We show that the limiting distribution of the Cramér–von Mises-type test statistics can be written as a sum of weighted

χ_{1}^{2}

random variables, where the weights depend on the eigenvalues of the covariance function of the high-dimensional linear operators. In our asymptotic relation, we let

n \to \infty

, and we can allow for a bounded dimension (d) and an increasing dimension (

d \to \infty

).

As another major contribution, two different procedures are proposed to estimate the limiting distribution of the proposed Cramér–von Mises test statistic: a plug-in calibration method and a subsampling method. The former estimates the eigenvalues of the covariance function to approximate the distribution of a linear combination of chi-squared distributions, the validity of which is guaranteed under normalized consistency, a new matrix convergence criterion. The latter can avoid estimating the eigenvalues by drawing inference from a large number of subsamples of the data.

This paper is organized as follows. In Section 2, we introduce the modified Cramér–von Mises statistics to test marginal distributions and copulas and develop the distributional approximation theory. In Section 3, we introduce two different procedures to implement a high-dimensional Cramér–von Mises test in practice and provide theoretical justification of the validity of both methods. In Section 4.1, we conduct a simulation study to evaluate the finite sample performance of the two approaches proposed in Section 3. All the proofs are provided in Appendix A.

We introduce some notations. For any real-valued square matrix (

A = (a_{i j})

), let

tr (A) = \sum_{j} a_{j j}

be the trace of A,

∥ A ∥ = \sqrt{λ_{max} (A^{T} A)}

be the spectral norm, and

{∥ A ∥}_{F} = {(\sum_{i, j} a_{i j}^{2})}^{1 / 2}

be the Frobenius norm. For a random variable (X) and some constant (

q > 0

), we define

{∥ X ∥}_{q} = {(E | X |}^{q})^{1 / q}

. If

X_{1}, X_{2}, \dots

is a sequence of random variables and

a_{1}, a_{2}, \dots

is a sequence of values such that

X_{n} / a_{n} \overset{p}{\to} 0

as

n \to \infty

, and we write

X_{n} = o_{p} (a_{n})

. If, for any

ϵ > 0

, there exist finite

M > 0

and

N > 0

such that

P (| X_{n} / a_{n} | > M) < ϵ

for any

n > N

, we write

X_{n} = O_{p} (a_{n})

. For two sequences of positive numbers (

{a_{n}}

and

{b_{n}}

), we write

a_{n} ≍ b_{n}

if there exist two positive constants (

C_{1}, C_{2}

) such that

C_{1} \leq a_{n} / b_{n} \leq C_{2}

for all large n values.

2. High-Dimensional Cramér–von Mises Test

In this section, we address the Cramér–von Mises-type test for high-dimensional data from a jointly continuous distribution. We first consider the testing of marginal distributions with and without a loss-of-generality test (uniform [0, 1]) for each marginal distribution. Let

U_{i} = {(U_{i 1}, \dots, U_{i d})}^{T}

and

i = 1, \dots, n

be independent and identically distributed random vectors of dimension

d \in N

. We aim to test

H_{0} : U_{i l} \sim uniform [0, 1], for all 1 \leq l \leq d .

(4)

Here, we can allow any dependence structure on d components of

U_{i}

. To account for the dimensionality, our Cramér–von Mises-type test statistic is defined by

Ω_{d} = \frac{1}{n} \sum_{l = 1}^{d} \sum_{i \neq j \leq n} \int_{0}^{1} (1_{U_{i l} \leq u} - u) (1_{U_{j l} \leq u} - u) d u .

(5)

Note that for

d = 1

,

Ω_{d}

is a modified version of (2) with the diagonal elements removed. The computation of

Ω_{d}

can be significantly simplified in view of the fact that

\int_{0}^{1} (1_{U_{i l} \leq u} - u) (1_{U_{j l} \leq u} - u) d u = \frac{U_{i l}^{2} + U_{j l}^{2}}{2} - max (U_{i l}, U_{j l}) + \frac{1}{3} .

Let

A = {(u, l) : 0 \leq u \leq 1; l = 1, \dots, d}

. For

a = (u, l), a^{'} = (u^{'}, l^{'}) \in A

, let

X_{i a} = 1_{U_{i l} \leq u} - u

and the covariance function be

σ_{a a^{'}} = E (X_{i a} X_{i a^{'}})

. We define the linear operator (

L

) as

L f (a^{'}) = \sum_{l = 1}^{d} \int_{0}^{1} σ_{a a^{'}} f (a) d u = \sum_{l = 1}^{d} \int_{0}^{1} σ_{(u, l) a^{'}} f (u, l) d u .

(6)

According to Mercer’s theorem [25], there exist countably many eigenvalues (

λ_{h} \geq 0

) and eigenfunctions (

f_{h} (\cdot)

,

h \in N^{+}

) such that

σ_{a a^{'}} = \sum_{h = 1}^{\infty} λ_{h} f_{h} (a) f_{h} (a^{'})

. We define

\begin{matrix} Ψ = \sum_{l = 1}^{d} \int_{0}^{1} (1_{U_{1 l} \leq u} - u) (1_{U_{2 l} \leq u} - u) d u = \sum_{l = 1}^{d} [\frac{U_{1 l}^{2} + U_{2 l}^{2}}{2} - max (U_{1 l}, U_{2 l}) + \frac{1}{3}] . \end{matrix}

Theorem 1.

Assume that

H_{0}

in (4) is true. Let

λ_{h}

and

h \in Z^{+}

be eigenvalues of the linear operator (

L

) in (6). We define

Λ^{2} = \sum_{h = 1}^{\infty} λ_{h}^{2}

and

W = \sum_{h = 1}^{\infty} λ_{h} (η_{h} - 1),

where

η_{h}

represents independent random variables (

χ_{1}^{2}

). Assume that for some

0 < δ \leq 1

,

\begin{matrix} {E (| Ψ / Λ |}^{2 + δ}) = o (n^{δ / 2}) . \end{matrix}

(7)

Then, there exists some constant (

C > 0

) such that

\begin{matrix} sup_{t} | P (Ω_{d} \leq t) - P (W \leq t) | \leq C {(\frac{{E | Ψ |}^{2 + δ}}{n^{δ / 2} Λ^{2 + δ}})}^{\frac{1}{2 δ + 5}} \to 0 . \end{matrix}

(8)

Remark 1.

In the special case of

d = 1

, it holds that

| Ψ | \leq 1

when

H_{0}

is true; then, condition (7) trivially holds. In this case, it holds that the covariance function (

σ_{u u^{'}} = min {u, u^{'}} - u u^{'}

,

0 \leq u, u^{'} \leq 1

) and the linear operator (

L f (u^{'}) = \int_{0}^{1} σ_{u u^{'}} f (u) d u

) have eigenvalues of

λ_{h} = h^{- 2} π^{- 2}

and eigenfunctions of

f_{h} (u) = \sqrt{2} sin (h π u)

,

h \in Z^{+}

(see Chapter 5 in [4] for arguments). In this case, our result (8) agrees with the classical Cramér–von Mises criterion.

Remark 2.

We now discuss the condition of the above theorem in multivariate scenarios and the convergence rate in the result (8). If components

U_{11}, \dots, U_{1 d}

are independent and identically distributed, then we can obtain

{E (| Ψ |}^{2 + δ}) = O (d^{1 + δ / 2})

by applying Rosenthal’s inequality [26]. In this case, elementary calculations imply

Λ^{2} = E (Ψ^{2}) = d / 90

; then, (7) holds for an arbitrary

0 < δ \leq 1

as long as

n \to \infty

, which is the minimal condition needed for n. The convergence rate in (8) is

O (n^{- δ / (10 + 4 δ)})

, which is sharper with a larger δ. More generally, if

{(U_{1 l})}_{l \in Z^{+}}

is a stable stationary process in that its functional dependence measures are summable [27], then we also have

{E (| Ψ |}^{2 + δ}) = O (d^{1 + δ / 2})

, and (7) holds as

n \to \infty

, noting that

Λ^{2} \geq d / 90

. If

U_{11}, \dots, U_{1 d}

are strongly dependent such that

Λ^{2} ≍ d^{2}

, since

| Ψ | \leq d

, the minimal condition (

n \to \infty

) also suffices for (7). We can obtain the rate (

O (E | Ψ / d |^{2 + δ} n^{- δ / (10 + 4 δ)})

). A larger value of δ is still favored to sharpen the rate in view of

| Ψ | \leq d

. More generally, allowing any dependence structure on

U_{11}, \dots, U_{1 d}

,

n^{- δ / 2} {E (| Ψ / Λ |}^{2 + δ}) = o (1)

is the restriction we should impose on the orders of d and n for the validity of the theorem. As a special case, if we further assume that d and n can satisfy

n^{- δ / 2} {E (| Ψ / Λ |}^{2 + δ}) ≍ n^{- C_{δ}}

, where

C_{δ} > 0

is a constant in terms of δ, then the rate is

O (n^{- C_{δ} / (5 + 2 δ)})

, an order depending on how

C_{δ} / (5 + 2 δ)

changes with δ.

Remark 3.

Theorem 1 above establishes a distributional approximation result for the diagonal-removed statistic (

Ω_{d}

). We now consider adding the diagonals back to investigate the full version,

Θ_{d} = n \sum_{l = 1}^{d} \int_{0}^{1} {[{\hat{F}}_{l} (u) - u]}^{2} d u = : Ω_{d} + D_{d} + \frac{d}{6},

where

{\hat{F}}_{l} (u) = n^{- 1} \sum_{i = 1}^{n} 1_{U_{i l} \leq u}

for

1 \leq l \leq d

, and

D_{d} = \frac{1}{n} \sum_{l = 1}^{d} \sum_{i = 1}^{n} \int_{0}^{1} {(1_{U_{i l} \leq u} - u)}^{2} d u - \sum_{h = 1}^{\infty} λ_{h} = : \frac{1}{n} \sum_{i = 1}^{n} κ_{i}

with

κ_{i} = \sum_{l = 1}^{d} [U_{i l}^{2} - U_{i l} + 1 / 6]

, in view of the fact that for

L

, as given in (6),

\sum_{h = 1}^{\infty} λ_{h} = \sum_{l = 1}^{d} \int_{0}^{1} σ_{a a} d u = d / 6

.

It can be determined that under

∥ κ_{1} {/ Λ ∥}_{2}^{2} = o (n),

Λ^{- 1} D_{d} = O_{p} (Λ^{- 1} n^{- 1 / 2} {∥ κ_{1} ∥}_{2}) = o_{p} (1) .

As a consequence, the distributional approximation result, i.e.,

sup_{t} | P (Θ_{d} \leq t) - P (\sum_{h = 1}^{\infty} λ_{h} η_{h} \leq t) | \to 0,

(9)

holds under (7). In this case, the limiting distribution of

Θ_{d}

is a sum of weighted

χ_{1}^{2}

random variables. Therefore, in the low-dimensional scenario where d is fixed and

∥ κ_{1} {/ Λ ∥}_{2}^{2} = o (n)

is satisfied, it becomes asymptotically equivalent to use either the diagonal-removed statistic (

Ω_{d}

) or the diagonal-included statistic (

Θ_{d}

). Both statistics have simple closed forms, allowing for straightforward computation based on

U_{i l}

for

1 \leq i \leq n

and

1 \leq l \leq d

, and share a similar type of limiting distribution, namely a linear combination of chi-squared random variables.

On the other hand, in the high-dimensional setting where d is allowed to grow with n, the quantity of

∥ κ_{1} {/ Λ ∥}_{2}^{2}

may be on the same order or much larger than n. In this case, according to the central limit theorem,

\sqrt{n} D_{d} / {∥ κ_{1} ∥}_{2} \Rightarrow N (0, 1)

if the Lindeberg condition holds for

κ_{i}

. Elementary calculation implies

E (Ω_{d}^{2}) = (2 - 2 / n) Λ^{2}

. If we assume

n = o (∥ κ_{1} / Λ ∥_{2}^{2})

, then

\sqrt{n} Ω_{d} / {∥ κ_{1} ∥}_{2} = o_{p} (1)

. We can obtain the normal convergence as follows:

\frac{\sqrt{n} (Θ_{d} - d / 6)}{∥ κ_{1} ∥_{2}} \Rightarrow N (0, 1) .

(10)

Thus, (9) and (10) illustrate an interesting dichotomy in high-dimensional settings: the statistic

Θ_{d}

, with diagonals included, can exhibit two different asymptotic distributions depending on the magnitudes of n and

∥ κ_{1} {/ Λ ∥}_{2}^{2}

. This divergence in asymptotic behavior across different regimes presents practical challenges in determining the appropriate relationship between n and

∥ κ_{1} {/ Λ ∥}_{2}^{2}

and, consequently, which type of asymptotic distribution should be applied. Relying on a subjective choice of asymptotic distribution could lead to unreliable conclusions. In contrast, using the diagonal-removed statistic (

Ω_{d}

) offers the advantage of avoiding this dichotomy.

Testing for copulas can be addressed in a similar way. Assume that

U_{i} = {(U_{i 1}, \dots, U_{i d})}^{T}

has a jointly continuous copula (

F (u_{1}, \dots, u_{d})

), with each marginal distribution being uniform [0, 1]. We test

H_{0} : F (u_{1}, \dots, u_{d}) = F_{0} (u_{1}, \dots, u_{d}), for all u_{l} \in [0, 1], 1 \leq l \leq d,

(11)

where

F_{0} (u_{1}, \dots, u_{d})

is specified. The modified Cramér–von Mises test statistic then becomes

\begin{matrix} Ω_{d}^{★} & = & \frac{1}{n} \sum_{i \neq j \leq n} \int_{{[0, 1]}^{d}} (\prod_{l = 1}^{d} 1_{U_{i l} \leq u_{l}} - F_{0} (u_{1}, \dots, u_{d})) \\ \times (\prod_{l = 1}^{d} 1_{U_{j l} \leq u_{l}} - F_{0} (u_{1}, \dots, u_{d})) \prod_{l = 1}^{d} d u_{l} . \end{matrix}

Let

B = {(u_{1}, u_{2}, \dots, u_{d}) : 0 \leq u_{l} \leq 1; l = 1, \dots, d}

. For

β = (u_{1}, u_{2}, \dots, u_{d})

and

β^{'} = (u_{1}^{'}, u_{2}^{'} \dots, u_{d}^{'}) \in B

, let

X_{i β} = \prod_{l = 1}^{d} 1_{U_{i l} \leq u_{l}} - F_{0} (u_{1}, \dots, u_{d})

and let the covariance function be

σ_{β β^{'}}^{★} = E (X_{i β} X_{i β^{'}})

. The associated linear operator (

L^{★}

) can be given by

\begin{matrix} L^{★} f^{★} (β^{'}) & = & \int_{{[0, 1]}^{d}} σ_{β β^{'}}^{★} f^{★} (β) \prod_{l = 1}^{d} d u_{l} \\ = & \int_{{[0, 1]}^{d}} σ_{(u_{1}, \dots, u_{d}) β^{'}}^{★} f^{★} (u_{1}, \dots, u_{d}) \prod_{l = 1}^{d} d u_{l} . \end{matrix}

We define

\begin{matrix} Ψ^{★} & = & \int_{{[0, 1]}^{d}} (\prod_{l = 1}^{d} 1_{U_{1 l} \leq u_{l}} - F_{0} (u_{1}, \dots, u_{d})) \\ \times (\prod_{l = 1}^{d} 1_{U_{2 l} \leq u_{l}} - F_{0} (u_{1}, \dots, u_{d})) \prod_{l = 1}^{d} d u_{l} . \end{matrix}

Corollary 1 below provides the result for

Ω_{d}^{★}

analogously.

Corollary 1.

Assume

H_{0}

in (11) is true. Let

λ_{h}^{★}

and

h \in Z^{+}

, be eigenvalues of the linear operator (

L^{★}

). We define

{(Λ^{★})}^{2} = \sum_{h = 1}^{\infty} {(λ_{h}^{★})}^{2}

and

W^{★} = \sum_{h = 1}^{\infty} λ_{h}^{★} (η_{h} - 1),

where

η_{h}

represents independent

χ_{1}^{2}

random variables. Assume that for some

0 < δ \leq 1

,

\begin{matrix} E (| Ψ^{★} / Λ^{★} |^{2 + δ}) = o (n^{δ / 2}) . \end{matrix}

(12)

Then, we have

\begin{matrix} sup_{t} | P (Ω_{d}^{★} \leq t) - P (W^{★} \leq t) | \to 0 . \end{matrix}

(13)

As in the test of marginals, the asymptotic distribution of

W^{★}

also depends on

F_{0}

. As a specific application, in many cases, we are interested in testing for independence. If the distribution (

F_{0} (u_{1}, \dots, u_{d})

; for example, a joint normal distribution with a known mean and covariance matrix) can suggest explicit conditional distributions given some components, we can first apply Rosenblatt transformation [13] to the data, then equivalently test whether the transformed sample is from a population uniformly distributed on the d-dimensional hypercube. Thus the problem can be equivalently transformed to test

H_{0} : F_{U} (u_{1}, \dots, u_{d}) = \prod_{l = 1} u_{l}, for all u_{l} \in [0, 1], 1 \leq l \leq d .

(14)

A sequence of work has been devoted to the testing of independence (14) in classical settings with a fixed d. The Cramér–von Mises test has been investigated by Blum et al. [14], Cotterill and Csörgö [15], Cotterill et al. [16], and Genest and Rémillard [17], among others. Our result in Corollary 1 made a theoretical contribution in the high-dimensional case, in which d can increase with n.

We acknowledge that known marginals and copulas in high-dimensional scenarios may not be readily accessible in many practical applications. Our primary objective is to provide a theoretical investigation, and the results presented in this paper serve as a crucial initial advancement toward investigating more realistic settings, for instance, distributions with estimated parameters.

3. Estimating Distributions of Linear Combinations of Chi-Squares

In Section 2, we presented the asymptotic results of Theorem 1 and Corollary 1. If the eigenvalues (

λ_{h}, h \geq 1

) were known, then we could derive the distribution of W either analytically or numerically via Monte Carlo simulation. With a significance level (

α

) such as

0.05

, take the critical value as

w_{1 - α}

, the

(1 - α)

th quantile of W. We reject

H_{0}

if the value of

Ω_{d}

exceeds

w_{1 - α}

. Similar steps apply for

W^{★}

. Practically, the eigenvalues are unknown. We can use (8) or (13) for high-dimensional Cramér–von Mises testing by estimating the distribution of W (resp.

W^{★}

), which is a linear combination of chi-squared distributions, and compute the threshold value, i.e.,

w_{1 - α}

(resp.

w_{1 - α}^{★}

). A natural calibration involves estimating the eigenvalues, and plugging them in to the distribution of W (resp.

W^{★}

). For the special case when

d = 1

, we can work out the eigenvalues as

λ_{h} = {(π h)}^{- 2}

. In most cases, however, it is highly nontrivial to analytically compute the covariance function (

σ

) and the associated

λ_{h}

values. To this end, we propose procedures to implement the high-dimensional Cramér–von Mises test in practice. We introduce two different approaches. In Section 3.1, we propose a plug-in calibration approach, the validity of which is theoretically justified in Section 3.2. In Section 3.3, we introduce a subsampling approach with its theoretical guarantee. In this context, we estimate the distribution of W. The case of

W^{★}

can be dealt with in a similar way. Table 1 provides a summary of the notations commonly used in this section.

3.1. A Plug-In Procedure

Let N be a sufficiently large integer and

p = N d

. We first apply the discretization technique to the interval

[0, 1]

by taking evenly spaced points (

u_{j} = j / N

) for

j = 1, \dots, N

. Let

A_{0} = {(u_{j}, l) : j = 1, \dots, N; l = 1, \dots, d}

, and let the p-dimensional vector be

X_{i} = {(X_{i a})}_{a \in A_{0}}

, where we recall that

X_{i a} = 1_{U_{i l} \leq u} - u

, with

a = (u, l)

. The covariance matrix is denoted as

Σ : = Σ_{N} = E (X_{i} X_{i}^{T})

, with eigenvalues of

γ_{p, 1} \geq \dots \geq γ_{p, p} \geq 0

and

Γ = {(\sum_{h = 1}^{p} γ_{p, h}^{2})}^{1 / 2}

. Note that

N^{- 1} Σ

is a discretized version of the linear operator (

L

) given in (6) and that the eigenvalues of

N^{- 1} Σ

converge to those of

L

as

N \to \infty

(cf. [25]). Recall that

W = \sum_{h = 1}^{\infty} λ_{h} (η_{h} - 1)

and

Λ = {(\sum_{h = 1}^{\infty} λ_{h}^{2})}^{1 / 2}

. The plug-in calibration approach consists of the three steps listed below.

Step 1: Find a good estimate of

Σ

based on the data (

X_{n} = (X_{1}, \dots, X_{n})

) and denote it by

\tilde{Σ}

, with eigenvalues of

{\tilde{γ}}_{p, 1} \geq \dots \geq {\tilde{γ}}_{p, p} \geq 0

. Let

\tilde{Γ} = {(\sum_{h = 1}^{p} {\tilde{γ}}_{p, h}^{2})}^{1 / 2}

and

\tilde{V} = {\tilde{Γ}}^{- 1} \sum_{h = 1}^{p} {\tilde{γ}}_{p, h} (η_{h} - 1),

(15)

where

η_{h}

represents i.i.d.

χ_{1}^{2}

random variables that are independent of

X_{n}

.

Step 2: Given

{\tilde{γ}}_{p, 1}, \dots, {\tilde{γ}}_{p, p}

, obtain the

(1 - α)

th quantile of

\tilde{V}

,

{\tilde{v}}_{1 - α}

, either analytically or by extensive simulation.

Step 3: Based on

{\tilde{γ}}_{p, 1}, \dots, {\tilde{γ}}_{p, p}

, compute

\tilde{Λ} = N^{- 1} \tilde{Γ} = N^{- 1} {(\sum_{h = 1}^{p} {\tilde{γ}}_{p, h}^{2})}^{1 / 2}

, which serves as an estimate for

Λ

. Then, use

{\tilde{v}}_{1 - α} \cdot \tilde{Λ}

as an estimate for the critical value of W at a significance level of

α

.

To ensure the validity of the above procedure, we need to impose suitable conditions so that the following requirements are met:

(i): The estimated quantile ( ${\tilde{v}}_{1 - α}$ ) is close to the $(1 - α)$ -th quantile of $W / Λ$ ;
(ii): The ratio consistency is $\tilde{Λ} / Λ \to 1$ in probability.

As discussed in Section 3.2, (i) requires that N be sufficiently large and that

\tilde{Σ}

be a normalized consistent estimate of

Σ

, which asserts that

\begin{matrix} ∥ \tilde{Σ} / \tilde{Γ} - Σ / Γ ∥ = o_{p} (1); \end{matrix}

(16)

(see Definition 1). In Section 3.2 we also discuss its relation with the classical spectral norm convergence (

∥ \tilde{Σ} - Σ ∥ = o_{p} (1)

). In this paper, we verify that the sample covariance matrix, i.e.,

\begin{matrix} \hat{Σ} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - \bar{X}) {(X_{i} - \bar{X})}^{T} \end{matrix}

(17)

can be a good choice of

\tilde{Σ}

in Step 1 and illustrate the validity of the procedure using

\hat{Σ}

in Theorem 2. The eigenvalues of

\hat{Σ}

are denoted by

{\hat{γ}}_{p, 1} \geq \dots \geq {\hat{γ}}_{p, p} \geq 0

. We can define

\hat{Γ} = {(\sum_{h = 1}^{p} {\hat{γ}}_{p, h}^{2})}^{1 / 2}

,

\hat{Λ} = N^{- 1} \hat{Γ}

, and

\hat{V} = {\hat{Γ}}^{- 1} \sum_{h = 1}^{p} {\hat{γ}}_{p, h} (η_{h} - 1)

accordingly.

Theorem 2.

Assume that

d^{2} = o (n Λ^{2})

. Under

H_{0}

in (4) and as

N \to \infty

, it holds that

\hat{Λ} / Λ \to 1

in probability, and with probability converging to 1,

sup_{t} | P (W / Λ \leq t) - P^{*} (\hat{V} \leq t) | \to 0,

(18)

where

P^{*}

denotes the conditional probability given

X_{n}

.

Remark 4.

To deal with

W^{*}

, we discretize the hypercube of

{[0, 1]}^{d}

. Let N be a sufficiently large number and

B_{0} = {1 / N, 2 / N, \dots, N / N}^{d}

. For

X_{i} = {(X_{i β})}_{β \in B_{0}}

,

N^{- d} Σ = N^{- d} E (X_{i} X_{i}^{T})

is a discretized version of the linear operator (

L^{★}

). Using similar arguments as Theorem 2 and replacing the condition with

n^{- 1} {(Λ^{★})}^{2} = o (1)

, we can verify the validity of the procedure when it applies to

W^{★}

.

Although Theorem 2 provides a theoretical guarantee for the plug-in method, practical implementation remains challenging due to the ultra-high dimension of the discretized sample covariance matrix (

\hat{Σ}

). In particular,

\hat{Σ}

is of size

p \times p

, with

p = N d

for the marginal testing problem. Directly estimating such a large number of eigenvalues involves a significant computational workload. To address this, an equivalent approach—Gaussian multiplier bootstrap—can be employed for computational efficiency. Recall that

A_{0} = {(u_{j}, l) : j = 1, \dots, N; l = 1, \dots, d}

and

X_{i} = {(X_{i a})}_{a \in A_{0}}

, where

X_{i a} = 1_{U_{i l} \leq u} - u

, with

a = (u, l)

. Also recall

\hat{Σ} = {(n - 1)}^{- 1} \sum_{i = 1}^{n} (X_{i} - \bar{X}) {(X_{i} - \bar{X})}^{T}

. Let B be a sufficiently large number and let

ϵ_{b i}, 1 \leq i \leq n, 1 \leq b \leq B

be i.i.d.

N (0, 1)

. Let

ξ_{b} = (ξ_{b a}, a \in A_{0})

with

ξ_{b a} = {(n - 1)}^{- 1 / 2} \sum_{i = 1}^{n} (X_{i a} - {\bar{X}}_{a}) ϵ_{b i} .

Then, it follows that

ξ_{b} | X_{n} \sim N (0, \hat{Σ})

. Thus, given the data (

X_{n}

),

\hat{W} = \sum_{h = 1}^{p} {\hat{λ}}_{p, h} (η_{h} - 1)

is distributed identically to the bootstrapped statistic (

{\hat{W}}_{b} = N^{- 1} (ξ_{b}^{⊤} ξ_{b} - t r (\hat{Σ}))

). Therefore, we can approximate the distribution of W according to the empirical distribution of

{\hat{W}}_{b} = N^{- 1} (ξ_{b}^{T} ξ_{b} - t r (\hat{Σ}))

.

{\hat{W}}^{- 1} (1 - α)

denotes the empirical

(1 - α)

-th quantile. Then,

H_{0}

is rejected if

Ω_{d} > {\hat{W}}^{- 1} (1 - α)

. Below in Algorithm 1,we provide the pseudo-code for the implementation of the plug-in calibration approach for the marginal testing problem. It can easily be adapted to test for joint distributions. In the implementation, we recommend taking the values of N and B as large integers, such as 1000.

Algorithm 1: Cut-off value approximation by plug-in calibration

3.2. Validity of the Plug-In Procedure

The idea of plug-in calibration is to approximate the distribution of

W / Γ

according to that of

\tilde{V}

defined in (15). To evaluate the validity of the distribution approximation, we first state a useful lemma concerning the closeness of two general linear combinations of chi-squared distributions.

Lemma 1.

For

M \in N^{+}

, let

a_{M, 1} \geq a_{M, 2} \geq \dots \geq a_{M, M} \geq 0

and

b_{M, 1} \geq b_{M, 2} \geq \dots \geq b_{M, M} \geq 0

be two sequences of real numbers satisfying

\sum_{h = 1}^{M} a_{M, h}^{2} = 1

and

\sum_{h = 1}^{M} b_{M, h}^{2} = 1

. Assume

ρ_{M} : = {max}_{h \leq M} |a_{M, h} - b_{M, h}| \to 0

as

M \to \infty

. Let

η_{h}

be i.i.d.

χ_{1}^{2}

random variables, and let

V_{a} = \sum_{h = 1}^{M} a_{M, h} (η_{h} - 1)

and

V_{b} = \sum_{h = 1}^{M} b_{M, h} (η_{h} - 1)

. Then, there exists some constant (

C > 0

) such that

\begin{matrix} sup_{t} | P (V_{a} \leq t) - P (V_{b} \leq t) | \leq C ρ_{M}^{1 / 12} \to 0 as M \to \infty . \end{matrix}

(19)

Similar to (15), we define

\begin{matrix} V = Γ^{- 1} \sum_{h = 1}^{p} γ_{p, h} (η_{h} - 1), \end{matrix}

where

γ_{p, h}

and

1 \leq h \leq p

are eigenvalues of

Σ

and

Γ = {(\sum_{h = 1}^{p} γ_{p, h}^{2})}^{1 / 2}

, as defined in Section 3.1.

According to Lemma 1, with

M \to \infty

and plugging in

{(a_{M, h})}_{1 \leq h \leq M}

and

{(b_{M, h})}_{1 \leq h \leq M}

with

{(λ_{h} / Λ)}_{h \geq 1}

and

{(γ_{p, h} / Γ)}_{h \geq 1}

, respectively, we have

sup_{t} | P (W / Λ \leq t) - P (V \leq t) | \leq C sup_{h \geq 1} {| λ_{h} / Λ - γ_{p, h} / Γ |}^{1 / 12},

where

γ_{p, h} = 0

for

h > p

. Note that the selection of N is arbitrary and is independent of d. We have

{lim}_{N \to \infty} {sup}_{h \geq 1} | λ_{h} - γ_{p, h} / N | = 0

for any d (cf. [25]). Similarly, we also have

N^{- 1} Γ / Λ \to 1

as

N \to \infty

. By choosing a sufficiently large N, the two requirements ((i) and (ii)) described in Section 3.1 can be induced to the following:

(i’): The estimated quantile ( ${\tilde{v}}_{1 - α}$ ) is close to $v_{1 - α}$ , the $(1 - α)$ th quantile of V;
(ii’): The ratio consistency is $\tilde{Γ} / Γ \to 1$ in probability.

For theoretical justification, we adopt the following more general setting in this subsection. Let

X_{i}

and

i \in N

be independent and identically distributed p-dimensional random vectors with a mean of

E (X_{i}) = 0

and covariance matrix of

cov (X_{i}) = Σ

with eigenvalues of

γ_{p, 1} \geq \dots γ_{p, p} \geq 0

and

Γ = {∥ Σ ∥}_{F} = {(\sum_{h = 1}^{p} γ_{p, h}^{2})}^{1 / 2}

. Let

\tilde{Σ}

be an estimate of

Σ

with eigenvalues of

{\tilde{γ}}_{p, 1} \geq \dots \geq {\tilde{γ}}_{p, p} \geq 0

and

\tilde{Γ} = {∥ \tilde{Σ} ∥}_{F} = {(\sum_{h = 1}^{p} {\tilde{γ}}_{p, h}^{2})}^{1 / 2}

. We employ the plug-in procedure described in Section 3.1 to approximate the distribution of V. According to Lemma 1, if

\begin{matrix} max_{h \leq p} | γ_{p, h} / Γ - {\tilde{γ}}_{p, h} / \tilde{Γ} | \to 0 in probability, \end{matrix}

(20)

then with the probability converging to 1, we have

\begin{matrix} sup_{t} | P (V \leq t) - P^{*} (\tilde{V} \leq t) | \to 0, \end{matrix}

(21)

where

P^{*}

denotes the conditional probability given

X_{n}

. With (21), requirement (i’) is met.

Next, we verify the condition (20). Interestingly, there is a simple sufficient condition for (20). According to Weyl’s theorem (cf. Theorem 8.1.5 in Golub and Van Loan [28]), (20) follows from

\begin{matrix} ∥ \tilde{Σ} / \tilde{Γ} - Σ / Γ ∥ = o_{p} (1) . \end{matrix}

(22)

We say an estimator of a matrix is normalized consistent if (22) holds, as stated in the definition below.

Definition 1.

Let

\tilde{Σ}

be an estimator of Σ. We also define

Γ = {∥ Σ ∥}_{F}

and

\tilde{Γ} = {∥ \tilde{Σ} ∥}_{F}

. We say

\tilde{Σ}

is normalized consistently for Σ if (22) holds.

Theorem 3 below indicates that under mild conditions, the sample covariance matrix,

\hat{Σ} = {(n - 1)}^{- 1} \sum_{i = 1}^{n} (X_{i} - \bar{X}) {(X_{i} - \bar{X})}^{T},

(23)

satisfies normalized consistency (22), which is a sufficient condition for

\hat{Σ}

to meet requirement (i’). Moreover, Theorem 3 shows

\hat{Γ} / Γ \to 1

in probability and, thus,

\hat{Σ}

meets requirement (ii’). Therefore, the theoretical justification of the validity of the plug-in procedure using

\hat{Σ}

is then complete.

Theorem 3.

We define

\hat{Γ} = {∥ \hat{Σ} ∥}_{F}

and recall that

Γ = {∥ Σ ∥}_{F}

. Assume

\begin{matrix} E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2}) . \end{matrix}

(24)

Then,

\hat{Γ} / Γ \to 1

in probability and

\begin{matrix} E ∥ \hat{Σ} / \hat{Γ} {- Σ / Γ ∥}_{F}^{2} = o (1), \end{matrix}

(25)

which further implies the normalized consistency (22) of

\hat{Σ}

.

Remark 5.

Inspection of the proof shows that Theorem 3 also holds for the sample covariance variant (

\hat{Σ} = n^{- 1} \sum_{i = 1}^{n} (X_{i} - μ) {(X_{i} - μ)}^{T}

) when

μ = E (X_{i})

is known. Theorem 3 requires the condition of

E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2})

. This condition trivially holds if the entries of

X_{1}

are strongly dependent in the sense that

Γ^{2} = \sum_{j, k \leq p} σ_{j, k}^{2} ≍ p^{2}

and, for some constant (

C < \infty

),

{max}_{j \leq p} {∥ X_{1 j} ∥}_{4} \leq C

. In this case,

E [{(X_{1}^{T} X_{1})}^{2}] \leq p^{2} C^{4}

, and the condition of

E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2})

reduces to the natural one (

n \to \infty)

.

Next, we examine the condition of

E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2})

for the marginal testing problem. Recall that

X_{i} = {(X_{i a})}_{a \in A_{0}}

. It can be calculated that

E [{(X_{1}^{⊤} X_{1})}^{2}] = E {{[\sum_{1 \leq j \leq d} (U_{1 j} - U_{1 j}^{2} / 2 + 5 / 6)]}^{2}} ≍ d^{2}

, and

Γ^{2} = \sum_{1 \leq j, k \leq d} \int_{u, u^{'}} [P (U_{1 j} \leq u, U_{1 k} \leq u^{'}) - u u^{'}] d u d u^{'} .

In the extreme case of complete dependency, where

U_{i j} = U_{i 1}

for all

1 \leq j \leq d

,

Γ^{2} = d^{2} / 12

. If the margins are mutually independent, then the quantity is

Γ^{2} ≍ d

, and the condition fails if

n = O (d)

.

Below, we provide an example to illustrate the sharpness of this condition in general.

Example 1.

Let

Z_{i j}, i, j \in Z

be i.i.d.

N (0, 1)

, and let

\begin{matrix} X_{i} = p^{- τ} (Z_{i 0}, Z_{i 0}, \dots, Z_{i 0}) + {(Z_{i 1}, Z_{i 2}, \dots, Z_{i p})}^{T}, for some 0 < τ \leq 1 / 4 . \end{matrix}

Then, the covariance matrix of

X_{i}

is

Σ = {Id}_{p} + p^{- 2 τ} {(1, 1, \dots, 1)}^{T} (1, 1, \dots, 1)

with eigenvalues of

γ_{p, 1} = p^{1 - 2 τ} + 1

and

γ_{p, j} = 1

for

2 \leq j \leq p

. It can be computed that

Γ^{2} = {(p^{1 - 2 τ} + 1)}^{2} + p - 1

and

E [{(X_{1}^{T} X_{1})}^{2}] = 2 Γ^{2} + {(p^{1 - 2 τ} + p)}^{2}

. Hence,

E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2})

holds if and only if

p^{4 τ} = o (n)

. Theorem 3 is then applicable.

On the other hand, we can show that if

n = o (p^{4 τ})

, the sample covariance (

\hat{Σ} = n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{T}

) does not satisfy (22), and thus,

\hat{Σ}

is not normalized consistently. In particular, we consider the eigendecomposition,

Σ = Q D Q^{T}

, where D is a diagonal matrix, i.e.,

D = diag (γ_{p, 1}, \dots, γ_{p, p})

, with

γ_{p, 1}, \dots, γ_{p, p}

being the eigenvalues of Σ and Q being the orthogonal matrix whose columns are the eigenvectors of Σ. Furthermore,

Y_{i} = Q^{T} X_{i}

. Then, it can be easily determined that

Y_{i} = {(Y_{i 1}, \dots, Y_{i p})}^{T}

are i.i.d.

N (0, D)

and

n \hat{Σ} = Q (\sum_{i = 1}^{n} Y_{i} Y_{i}^{T}) Q^{T}

. For

\hat{Γ} = {∥ \hat{Σ} ∥}_{F}

, it follows that

\begin{matrix} E ({\hat{Γ}}^{2}) & = & \frac{1}{n^{2}} [\sum_{1 \leq i \neq l \leq n, 1 \leq j, k \leq p} E (Y_{i j} Y_{i k} Y_{l j} Y_{l k}) + \sum_{1 \leq i \leq n, 1 \leq j, k \leq p} E (Y_{i j}^{2} Y_{i k}^{2})] \\ = & \frac{1}{n^{2}} [n (n - 1) \sum_{j = 1}^{p} γ_{p, j}^{2} + 3 n \sum_{j = 1}^{p} γ_{p, j}^{2} + n \sum_{1 \leq j \neq k \leq p} γ_{p, j} γ_{p, k}] \\ = & \frac{n + 1}{n} [{(p^{1 - 2 τ} + 1)}^{2} + p - 1] + \frac{1}{n} {(p^{1 - 2 τ} + p)}^{2} . \end{matrix}

In a similar way, elementary calculations imply

\begin{matrix} E [tr ({\hat{Σ}}^{4})] & = [p + p^{4 - 8 τ} + n^{- 1} (4 p^{4 - 6 τ} + 6 p^{2}) + n^{- 2} (6 p^{3} + 6 p^{4 - 4 τ}) + n^{- 3} p^{4}] \\ \times [1 + o (1)], \end{matrix}

\begin{matrix} E ({\hat{Γ}}^{4}) & = [p^{2} + 2 p^{3 - 4 τ} + p^{4 - 8 τ} + n^{- 1} (2 p^{3} + 2 p^{4 - 4 τ} + 12 p^{4 - 8 τ}) + n^{- 2} p^{4}] \\ \times [1 + o (1)] . \end{matrix}

In the case of

n = o (p^{4 τ})

, it holds that

{lim}_{n \to \infty} E [tr ({\hat{Σ}}^{4})] / {[E ({\hat{Γ}}^{2})]}^{2} = 0

and

{\hat{Γ}}^{2} / E ({\hat{Γ}}^{2}) \to 1

in probability, which further implies

tr ({\hat{Σ}}^{4}) / {\hat{Γ}}^{4} = o_{p} (1)

. Then (22), fails under

n = o (p^{4 τ})

in view of

{lim}_{p \to \infty} ∥ Σ / Γ ∥ = 1

and

∥ \hat{Σ} / \hat{Γ} ∥^{4} \leq tr (\hat{Σ^{4}}) / {\hat{Γ}}^{4} = o_{p} (1)

.□

The concept of normalized consistency in (22) is closely related to but different from the classical definition of spectral norm consistency in the sense of

\begin{matrix} ∥ \tilde{Σ} - Σ ∥ = o_{p} (1) . \end{matrix}

(26)

Normalized consistency does not generally imply spectral norm consistency (26). For example, let

n = p

and

X_{i}

be i.i.d. standard

N (0, {Id}_{p})

random vectors. According to the random matrix theory, (26) does not hold for the sample covariance matrix (

\hat{Σ} = n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{T}

), which is not a consistent estimate of

Σ = {Id}_{p}

(see Marčenko and Pastur [29], Wachter [30], Geman [31]). Indeed, the largest eigenvalue of

\hat{Σ}

converges to 4, while the smallest one converges to 0. However, the normalized consistency (22) holds, since both

∥ Σ / Γ ∥ = p^{- 1 / 2} \to 0

and

∥ \hat{Σ} / \hat{Γ} ∥ = O_{p} (p^{- 1 / 2}) \to 0

, where

\hat{Γ} = {∥ \hat{Σ} ∥}_{F}

. Without further conditions, spectral norm consistency (26) does not imply normalized consistency either. Proposition 1 relates these two types of convergence, which are of independent interest.

Proposition 1.

For an estimate (

\tilde{Γ}

) of Γ, assume that

\tilde{Γ} / Γ \to 1

in probability. Then, the normalized consistency (22) holds if and only if

∥ \tilde{Σ} - Σ ∥ = o_{p} (Γ)

.

3.3. A Subsampling Procedure

Plug-in calibration requires a discretization step and imposes condition (24) in addition to those of Theorem 1. In this section, we provide a subsampling approach with a different idea from the plug-in approach, which avoids the discretization step and covariance matrix estimation. For a subset (

S \subset {1, 2, \dots, n}

), let

| S |

be its cardinality. Each marginal empirical function for the subsample and the entire sample are denoted as follows:

{\hat{F}}_{S, l} (u) = \frac{1}{| S |} \sum_{i \in S} 1_{U_{i l} \leq u}, {\hat{F}}_{n, l} (u) = \frac{1}{n} \sum_{i = 1}^{n} 1_{U_{i l} \leq u} .

(27)

Associated with the subset (S), the Cramér–von Mises-type subsampling statistic for the testing of marginal distributions is written as

\begin{matrix} Ω_{d, S}^{⋄} & = | S | \sum_{l = 1}^{d} \int_{0}^{1} {[{\hat{F}}_{S, l} (u) - {\hat{F}}_{n, l} (u)]}^{2} d u \\ = | S | \sum_{l = 1}^{d} \int_{0}^{1} {[{(| S |}^{- 1} - n^{- 1}) \sum_{i \in S} 1_{U_{i l} \leq u} - n^{- 1} \sum_{i \in S^{c}} 1_{U_{i l} \leq u}]}^{2} d u, \end{matrix}

(28)

where

S^{c} = {1, 2, \dots, n} - S

. Note that

\int_{0}^{1} 1_{U_{i l} \leq u} \cdot 1_{U_{j l} \leq u} d u = 1 - max (U_{i l}, U_{j l})

for

i \neq j

. We consider a variant of (28) by excluding the diagonal elements.

\begin{matrix} Ω_{d, S} & = & {| S | (| S |}^{- 1} - n^{- 1})^{2} \cdot \frac{| S |}{| S | - 1} \sum_{l = 1}^{d} \sum_{i, j \in S : i \neq j} [1 - max (U_{i l}, U_{j l})] \\ + | S | n^{- 2} \cdot \frac{n - | S |}{n - | S | - 1} \sum_{l = 1}^{d} \sum_{i, j \in S^{c} : i \neq j} [1 - max (U_{i l}, U_{j l})] \\ - 2 | S | n^{- 1} {(| S |}^{- 1} - n^{- 1}) \sum_{i \in S} \sum_{j \in S^{c}} [1 - max (U_{i l}, U_{j l})] . \end{matrix}

(29)

Note that the quantities

| S | / (| S | - 1)

and

(n - | S |) / (n - | S | - 1)

in

Ω_{d, S}

are introduced to remove the bias. We introduce empirical cumulative distribution functions (30) and (31) and the following subsampling-based procedure for estimation of the distribution function (

G (t) = P (Ω_{d} \leq t)

):

Let $m = m_{n} \in N$ be such that $m \to \infty$ and $m = o (n)$ ; let $H$ be a large number and $S_{1}, \dots, S_{H}$ be i.i.d. uniformly sampled from the class expressed as $S : = \{S : S \subset {1, \dots, n}, |S| = m\}$ . Assume that the sampling process ( ${(S_{h})}_{h \geq 1}$ ) is independent of ${(U_{i})}_{i \geq 1}$ . We define

$\begin{matrix} \overset{ˇ}{G} (t) = \frac{1}{H} \sum_{h = 1}^{H} 1 {Ω_{d, S_{h}} \leq t} . \end{matrix}$

(30)

The $(1 - α)$ th empirical quantile ( ${\overset{ˇ}{G}}^{- 1} (1 - α)$ ) is obtained from (30).
$H_{0} : U_{i l} \sim uniform [0, 1], i s r e j e c t e d f o r a l l 1 \leq l \leq d$ at the level of $α$ if $Ω_{d} > {\overset{ˇ}{G}}^{- 1} (1 - α)$ .

As a slightly different version, one can also employ the following procedure. Let the index be set to

T_{k} = {l \in Z : (k - 1) m < l \leq k m}

,

j = 1, \dots, K

, where

K = ⌊ n / m ⌋

with

⌊ x ⌋ = max {i \in Z : i \leq x}

. The subsampling empirical distribution function is defined as

\begin{matrix} \hat{G} (t) = \frac{1}{K} \sum_{k = 1}^{K} 1 {Ω_{d, T_{k}} \leq t} . \end{matrix}

(31)

Based on the result of (8), we can show that the subsampling approach is asymptotically valid (see Theorem 4 below for details, which shows that the empirical distribution functions defined in (30) and (31) both converge to the distribution function (

G (t) = P (Ω_{d} \leq t)

)). The subsample proportion (

m / n

) is a tuning parameter. Theoretically, we only require

m \to \infty

and

m / n \to 0

.

Theorem 4.

Let

m \to \infty

and

m = o (n)

. Assume that (7) holds, with n therein replaced by m. Then, under

H_{0}

in (4), (i)

\begin{matrix} sup_{t} | \hat{G} (t) - P (Ω_{d} \leq t) | \to 0 in probability; \end{matrix}

(32)

(ii) If

H \to \infty

, then the convergence (32) also holds for

\overset{ˇ}{G} (t)

.

The subsampling method avoids the discretization step. In addition, Theorem 4 imposes no additional condition beyond those of Theorem 1. It only requires condition (7), with n replaced by m, where the subsample size (m) can be any value as long as

m \to \infty

and

m = o (n)

. Comparatively, plug-in calibration requires an additional sharp and sufficient condition (24) to guarantee its consistency, which may result in conflicts in high-dimensional settings(see Remark 5). On the other hand, a scrutinization of the proof also reveals that the convergence rate is upper-bounded by the rate in (8) m with n replaced by m, which hampers the convergence rate if n is small.

For ease of implementation, we provide the pseudo-code for the subsampling approach in Algorithm 2. A recommended value of the m parameter is

⌊ \sqrt{n} ⌋

, and the subsampling number (

H

) is a large integer, such as 1000. A flowchart of the overall procedure for hypothesis testing is presented in Figure 1.

Algorithm 2: Cut-off value approximation by subsampling

Remark 6.

The subsampling approach can also be applied to the testing of joint distributions based on

Ω_{d}^{★}

. The subsampling test statistic can be corrected as

\begin{matrix} Ω_{d, S}^{★} & = & {| S | (| S |}^{- 1} - n^{- 1})^{2} \cdot \frac{| S |}{| S | - 1} \sum_{i, j \in S : i \neq j} \prod_{l = 1}^{d} [1 - max (U_{i l}, U_{j l})] \\ + | S | n^{- 2} \cdot \frac{n - | S |}{n - | S | - 1} \sum_{l = 1}^{d} \sum_{i, j \in S^{c} : i \neq j} \prod_{l = 1}^{d} [1 - max (U_{i l}, U_{j l})] \\ - 2 | S | n^{- 1} {(| S |}^{- 1} - n^{- 1}) \sum_{i \in S} \sum_{j \in S^{c}} \prod_{l = 1}^{d} [1 - max (U_{i l}, U_{j l})] . \end{matrix}

4. Numerical Analysis

4.1. A Simulation Study

In this section, a simulation study is conducted to evaluate the performance of the test with the plug-in method and subsampling calibration. For

i = 1, 2, \dots, n

, data are generated by

\begin{matrix} Y_{i l} = {(1 + a^{2})}^{- 1 / 2} (a Z_{i 0} + Z_{i l}), 1 \leq l \leq d, \end{matrix}

(33)

where

Z_{i j}

represents i.i.d. standard normal random variables.

Y_{i} \in R^{d}, 1 \leq i \leq n

are i.i.d

N (0, Σ (a))

random vectors, where

Σ (a)

has diagonal entries (1) and off-diagonal entries (

a^{2} / (1 + a^{2})

), i.e.,

\begin{matrix} Σ (a) = {(1 + a^{2})}^{- 1} [a^{2} {(1, 1, \dots, 1)}^{T} (1, 1, \dots, 1) + {Id}_{d}] . \end{matrix}

(34)

The value a controls the strength of the cross-sectional dependence among the components. We consider testing of the null hypothesis in the marginal test.

H_{0}^{M} : Y_{i l} \sim N (0, 1), for all 1 \leq l \leq d .

Let

Φ (\cdot)

be the c.d.f. of the standard normal random variable. To test

H_{0}^{M}

, we first obtain

U_{i l} = Φ (Y_{i l}), 1 \leq i \leq n, 1 \leq l \leq d

. Then, the test is equivalent to (4).

The modified Cramér–von Mises statistics (5) are computed. Plug-in and subsampling calibration are performed following Algorithm 1 and Algorithm 2, respectively, where we take

H = 1000, m = ⌊ \sqrt{n} ⌋

, and

B = 500

.

In the simulation study, we take

n = 100

,

d = 3, 10, 50

, and

a = 0, 0.5, 1

. respectively. The p values under the null hypotheses are reported from each of the 1000 repetitions.

In Figure 2, Figure 3 and Figure 4, we present the empirical distribution of the p values under

H_{0}^{M}

, namely with

μ^{0} = 0

and

Σ^{0} = Σ_{a}

. Under

H_{0}

, the p values obtained according to the true sampling distribution of the test statistic have a uniform distribution between 0 and 1. Therefore, the p values estimated from an accurate approximation of the sampling distribution should have an empirical cumulative distribution function close to the 45-degree line. It is observed from Figure 2, Figure 3 and Figure 4 that under

H_{0}

, the p values of our Cramér–von Mises-type test obtained via subsampling calibration are uniformly distributed between 0 and 1 for all scenarios with

a = 0, 0.5, 1

and

d = 3, 10, 50

, which verifies Theorem 4. The plug-in procedure, on the other hand, results in uniform p values in classical multivariate cases with

d = 3

and a moderately low dimension of

d = 10

, while in high-dimensional scenarios with

d = 50

, p values are less dispersed in the cross-sectionally independent (

a = 0

) and weakly dependent (

a = 0.5

) cases, so the corresponding procedure tends to result a smaller true size than the nominal size with small values such as 0.05. This observation indicates that subsampling calibration outperforms plug-in calibration in high-dimensional settings with moderately large n values, which aligns with Remark 5. The computation loads of the plugin and subsampling procedures are similar, as the plug-in method is implemented with a Gaussian multiplier bootstrap, which avoids estimation of the ultra-high-dimensional covariance matrix and its eigenvalues (c.f. Algorithm 1).

To assess the power, the vectors (

U_{i}

) are replaced by

U_{i}^{'} = {(U_{i l}^{'})}_{1 \leq l \leq d}

for

1 \leq i \leq n

, where

U_{i l}^{'} = U_{i l}

for

⌊ s d ⌋ < l \leq d

and

U_{i l}^{'} = F_{B e t a (2, 2)}^{- 1} (U_{i l}) for 1 \leq l \leq ⌊ s d ⌋,

with the proportion (s) evaluated at

s = 0, 0.3, 1

in the simulation. In the data, the couplaremains unchanged, and the marginal distributions in the first

⌊ s d ⌋

margins follow a

B e t a (2, 2)

distribution. Four methods, referred to as plug-in, subsampling, BH, and ZW, are compared. They are based on modified Cramér–von Mises statistics (5) with plug-in and subsampling calibration, the Komolgorov–Smirnov-type statistic presented in [11], and the Benjamini–Horchberg procedure (c.f., [32]), respectively. The corresponding nominal significance levels and false discovery rate are set to 0.05. In Table 2, we report the proportions of rejection, i.e., sizes and powers for

s = 0

and

s > 0

, respectively, obtained from 500 repetitions.

The scenarios with

s = 0

are from the same simulation repetitions. The plug-in procedure exhibits lower sizes than nominal, especially when

d = 50

and

a = 0

, which aligns with the empirical CDF plots. When the changes are moderately sparse and dense, with

s = 0.3

and

s = 1

, respectively, both procedures of the modified Cramèr–von Mises methods outperform both the BH and ZW procedures in almost all the dimension and dependence combinations.

4.2. An Illustrative Example

Normality assumptions are usually imposed, for example, for forecast intervals using vector autoregressive (VAR) models. In the event of conflict, under certain conditions, central limit theorem-type results, can still support the asymptotic normality of the estimation error of the conditional expectation. However, depiction of the distribution of the white-noise component is required to construct the forecast intervals.

In this section, we illustrate the application of our proposed test statistic (

Ω_{d}

) on the test residuals from a high-dimensional VAR model studied in [33]. A VAR model was used to analyze a macroeconomic dataset compiled by Stock and Watson [34] and enlarged by Koop [35]. The full dataset contains 168 quarterly macroeconomic indicators between Quarter 2, 1959 and Quarter 4, 2007, which are available in The Journal of Applied Econometrics Data Archive.

In this application, we test the residuals from the medium-size VAR model following [33], where the VAR model estimated with Lasso was considered. The time series of

y_{t} = (y_{t, k}, k = 1, \dots, 20), t = 1, \dots, 195

includes 20 quarterly macroeconomic series, such as real gross domestic product (GDP251), the Consumer Price Index (CPIAUSL), and the Federal Funds Rate (FYFF). Here,

y_{t}

is assumed to follow a VAR model, with the white noise process expressed as

ϵ_{t} \in R^{20}

. In [33], the 195 quarterly time series were partitioned such that the first 69 quarters’ data were used to train the Lasso algorithm and estimate A, the next 75 were used for for cross-validation, and the last 61 time points were used for evaluation of the performance. To test the marginal distribution of the residuals, we input the residuals extracted from the test set using the R functions proposed by Nicholson et al. [33], with

n = 61

and

d = 20

. The residuals are obtained with the BigVAR package and data described in [33], which are available at https://cran.r-project.org/web/packages/BigVAR/index.html (accessed on 1 November 2024) and https://github.com/wbnicholson/BigVAR (accessed on 1 November 2024), respectively.

We remark that testing the normality of errors in high-dimensional VAR models is, itself, a challenging problem. Here, we only consider the marginal tests of normality, which is a necessary but insufficient condition of the joint Gaussian assumption. The mean vector of the prediction residuals is asymptotically 0 in the

l_{2}

norm under certain conditions (e.g., [36]). As an illustration of our test, here, we simply rescale each residual series (

{e_{\cdot, l}}, 1 \leq l \leq 20

) with its sample standard deviation (

s_{l}

), i.e.,

X_{i l} = e_{i l} / s_{l}, 1 \leq i \leq 61

, and test if all the marginal distributions are standard normal:

H_{0} : X_{l} \sim N (0, 1), 1 \leq l \leq 20 .

A more careful approximation and rigorous analysis of

F_{n} (ϵ_{\cdot, j})

, as well as the test for joint normality, should be considered in future work.

The value of

Ω_{d} = 2.291

and the p values estimated from the plug-in and subsampling methods are

0.027

and

0.008

, respectively, both resulting in the rejection of the Gaussian marginal null hypothesis at a significance level of

0.05

. Comparatively, classical multiple comparisons using individual Cramér–von Mises tests fails to reject the null hypothesis. Indeed, the minimum five p values are

0.00259, 0.0104, 0.0131

, and

0.0223

. Bonferroni’s calibration with a familywise error rate of

0.05

has a cut-off value of

0.05 / 20 = 0.0025

, which suggests not rejecting any individual null hypothesis. Additionally, the Benjamini–Hochberg method [32] with a 0.05 false discovery rate also results in the rejection of none of the null hypotheses. Multiple comparisons are known to be conservative when cross-sectional dependence exists. A scrutinization of the QQ-normal plot (Figure 5) reveals that there are residual series that deviate remarkably from normal.

The lowest univariate Cramér–von Mises p-value is from the residuals of the FMRRA series (Depository Institution Reserves: total, adjusted for reserve requirement changes,

k = 6

, p-value =

0.00259

), which apparently has heavier tails than the Gaussian case. The p value is also close to the Bonferroni correction cutoff value of

0.0025

. Now, we consider fitting a t-distribution. We utilize maximum likelihood estimation with R function

fitdistr

and fit the FMRRA residual has a t-marginal distribution with a location of

0.058

, scale of

0.304

and degree of freedom of

1.729

, denoted as

F_{A, 6}

.

Now, we test the null hypothesis.

H_{0}^{'} : X_{k} \sim N (0, 1), k = 1, \dots, 5, 7, \dots, 20; and X_{6} \sim F_{A, 6} .

Applying test (5), we find

Ω_{d} = 1.294

, and the estimated p values from the plug-in and subsampling methods are

0.1

and

0.057

, respectively. Thus,

H_{0}^{'}

is not rejected at the significance level of

0.05

. Our test suggests that it is more appropriate to use these distributions rather than all Gaussian margins in further inference.

5. Conclusions

In this paper, we introduced a Cramér–von Mises-type test for assessing the goodness of fit in high-dimensional continuous data. By establishing an asymptotic theory for the proposed test statistic, which is based on quadratic functions of high-dimensional stochastic processes, we provide a robust framework for analyzing complex, high-dimensional distributions. To address the challenge of estimating the limiting distribution of the test statistic, we proposed two practical approaches: the plug-in calibration method and the subsampling method. Both techniques were supported by rigorous theoretical justifications, ensuring their reliability in high-dimensional settings. Our results demonstrate the utility of the proposed test and its calibration methods, making it a valuable tool for high-dimensional statistical analysis. Future work may further explore the practical applications of these methods in various high-dimensional contexts, as well as potential improvements in computational efficiency.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12223467/s1.

Author Contributions

Conceptualization, D.Z. and M.X.; methodology and proof, D.Z. and M.X.; numerical study, D.Z. and M.X.; writing—original draft preparation, D.Z. and M.X.; writing—review and editing, D.Z. and M.X. All authors have read and agreed to the published version of the manuscript.

Funding

Danna Zhang was supported by NSF Grant DMS-1916290.

Data Availability Statement

Data is contained within the article or Supplementary Materials. The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Proofs

In this appendix, we provide the technical proofs for the results presented in Section 2 and Section 3.

Appendix A.1. Proof of Results in Section 2

We first state a lemma (cf. Lemma 2 in [10]) that will be used to prove the main theorems. It provides a continuity result for a linear combination of chi-squared distributions. We remark that the result holds when p is either finite or infinite.

Lemma A1.

Let

a_{1} \geq \dots \geq a_{p} \geq 0

be such that

\sum_{i = 1}^{p} a_{i}^{2} = 1

, and let

η_{i}

be i.i.d.

χ_{1}^{2}

random variables. Then, for all

h > 0

,

\begin{matrix} sup_{t} P (t \leq a_{1} η_{1} + \dots + a_{p} η_{p} \leq t + h) \leq h^{1 / 2} \sqrt{4 / π} . \end{matrix}

(A1)

Now, we provide the proof of Theorem 1. The proof of Corollary 1 is similar and, thus, omitted.

Proof of Theorem 1.

Recall that

λ_{h}

and

f_{h} (\cdot)

are eigenvalues and eigenfunctions of the linear operator (

L

) defined in (6). Let

Z_{i j}

represent independent and identically distributed (

N (0, 1)

) random variables. For

a = (u, l) \in A

, we define

Y_{i a} = \sum_{h = 1}^{\infty} \sqrt{λ_{h}} Z_{i h} f_{h} (a) .

(A2)

Then

Y_{i \cdot}

represents independent copies of a Gaussian process with a mean-value function of 0 and the same covariance function as

X_{i}

. We define

\begin{matrix} 〈 X_{i}, X_{j} 〉 = \sum_{l = 1}^{d} \int_{0}^{1} X_{i a} X_{j a} d u = \sum_{l = 1}^{d} \int_{0}^{1} (1_{U_{i l} \leq u} - u) (1_{U_{j l} \leq u} - u) d u . \end{matrix}

(A3)

The quantity

〈 Y_{i}, Y_{j} 〉

can be defined similarly. We write

Ω_{d} = \frac{1}{n} \sum_{i \neq j \leq n} 〈 X_{i}, X_{j} 〉 and Ω_{Y} = \frac{1}{n} \sum_{i \neq j \leq n} 〈 Y_{i}, Y_{j} 〉 .

According to the definition of

Y_{i}

in (A2), we have

\begin{matrix} Ω_{Y} & = & \frac{1}{n} \sum_{i \neq j \leq n} 〈 Y_{i}, Y_{j} 〉 = \frac{1}{n} \sum_{i \neq j \leq n} \sum_{h = 1}^{\infty} λ_{h} Z_{i h} Z_{j h} \\ = & \sum_{h = 1}^{\infty} λ_{h} [(n - 1) {({\bar{Z}}_{\cdot h})}^{2} - \frac{1}{n} \sum_{i = 1}^{n} {(Z_{i h} - {\bar{Z}}_{\cdot h})}^{2}], \end{matrix}

where

{\bar{Z}}_{\cdot h} = n^{- 1} \sum_{i = 1}^{n} Z_{i h}

. Then, we have the following distributional equality:

Ω_{Y} =_{D} \sum_{h = 1}^{\infty} λ_{h} ((1 - n^{- 1}) η_{h} - n^{- 1} ξ_{h}),

where

η_{h}

and

h \in N

, are independent,

χ_{1}^{2}

and

ξ_{h}

are independent, and

χ_{n - 1}^{2}

and are also independent of

η_{h}

. Furthermore,

W = \sum_{h = 1}^{\infty} λ_{h} (η_{h} - 1)

and

Δ = Ω_{Y} - W = - \sum_{h = 1}^{\infty} λ_{h} (n^{- 1} η_{h} + n^{- 1} ξ_{h} - 1) .

Note that

Λ^{2} = \sum_{h = 1}^{\infty} λ_{h}^{2}

and

E (Δ^{2}) = 2 Λ^{2} / n

. We can apply similar proofs of Theorem 4 in [10] to obtain

sup_{t} | P (Ω_{d} \leq t) - P (W \leq t) | \leq C_{q} L_{δ}^{1 / (2 δ + 5)} \to 0,

under the condition of

L_{δ} : = \frac{E Θ^{1 + δ / 2}}{n^{δ / 2} Λ^{2 + δ}} + \frac{{E | Ψ / Λ |}^{2 + δ}}{n^{δ}} \to 0,

(A4)

for

0 < δ \leq 1

, where

Θ = \sum_{l, l^{'} = 1}^{d} \int_{0}^{1} \int_{0}^{1} X_{1 a} σ_{a a^{'}} X_{1 a^{'}} d u d u^{'}

. Since

Θ = E (〈 X_{1}, X_{2} 〉 \cdot 〈 X_{2}, X_{1} 〉 | U_{1 l}, l = 1, \dots, d),

according to Jensen’s inequality, we can have a neat upper bound

E Θ^{1 + δ / 2} \leq E {(〈 X_{1}, X_{2} 〉 \cdot 〈 X_{2}, X_{1} 〉)}^{1 + δ / 2} = E | 〈 X_{1}, X_{2} 〉 |^{2 + δ} = E {| Ψ |}^{2 + δ} .

(A5)

According to (A5), (A4) satisfies when (7) is imposed. □

Appendix A.2. Proof of Results in Section 3

In this section, we prove all the results presented in Section 3.

Proof of Theorem 2.

With the construction of

X_{i}

, as

N \to \infty

, the eigenvalues of

N^{- 1} Σ

converge to

{(λ_{h})}_{h = 1}^{\infty}

(cf. [25]). Clearly, as

N \to \infty

,

N^{- 2} {∥ Σ ∥}_{F}^{2} = N^{- 2} Γ^{2} \to Λ^{2}

and

{max}_{h} | γ_{p, h} / Γ - λ_{h} / Λ | \to 0

. According to Lemma 1, as

N \to \infty

,

sup_{t} | P (W / Λ \leq t) - P (V \leq t) | \to 0 .

(A6)

Since

| X_{i j} | \leq 1

for any

i = 1, \dots n

and

j = 1, \dots, p

, it follows that

X_{1}^{T} X_{1} \leq N d .

According to Theorem 3,

\hat{Σ}

is normalized consistently for

Σ

for all sufficiently large N values, suggesting

sup_{t} | P (V \leq t) - P^{*} (\hat{V} \leq t) | \to 0 .

(A7)

According to (A6) and (A7), (18) follows. According to Theorem 3, we also have

\hat{Γ} / Γ \to 1

in probability, which further implies

\hat{Λ} / Λ \to 1

in probability by noticing

N^{- 1} Γ \to Λ

as

N \to \infty

and

\hat{Λ} = N^{- 1} \hat{Γ}

. □

Proof of Lemma 1.

An integer sequence (

K = K_{M}

) is chosen such that

K_{M} \to \infty

and

K_{M} ρ_{M} \to 0

. Let

η_{h}^{'} = η_{h} - 1

,

Q = \sum_{h = 1}^{K - 1} a_{M, h} η_{h}^{'}

,

Q^{\circ} = \sum_{h = K}^{M} a_{M, h} η_{h}^{'}

,

S = \sum_{h = 1}^{K - 1} b_{M, h} η_{h}^{'}

,

S^{\circ} = \sum_{h = K}^{p} b_{M, h} η_{h}^{'}

,

q = 2 \sum_{h = K}^{M} a_{M, h}^{2}

, and

s = 2 \sum_{h = K}^{M} b_{M, h}^{2}

. Let

u_{K} = a_{M, K}^{1 / 3}

. According to the Gaussian approximation result in [37], in a richer probability space, we can construct a random variable (

Z \sim N (0, 1)

) independent of

{(η_{h})}_{h = 1}^{K - 1}

such that with some absolute constant (

c > 0

),

\begin{matrix} P (| Q^{\circ} - q^{1 / 2} Z | \geq u_{K}) \leq \frac{c}{u_{K}^{4}} \sum_{h = K}^{M} a_{M, h}^{4} \leq \frac{c}{u_{K}^{4}} a_{M, K}^{2} = c u_{K}^{2} . \end{matrix}

(A8)

As

u_{K} \to 0

, according to Lemma A1 and the triangle inequality, for some

C > 0

,

\begin{matrix} sup_{t} | P (| Q + Q^{\circ} | \leq t) - P (| Q + q^{1 / 2} Z | \leq t) | \leq C {u_{k}}^{1 / 2} \to 0 . \end{matrix}

(A9)

Similarly, for

v_{K} = b_{M, K}^{1 / 4}

, we can also construct a probability space with an r.v.

Z^{*} \sim N (0, 1)

such that

P (| S^{\circ} - w^{1 / 2} Z^{*} | \geq v_{K}) \leq c v_{K}^{2}

and

\begin{matrix} sup_{t} | P (| S + S^{\circ} | \leq t) - P (| S + s^{1 / 2} Z^{*} | \leq t) | \leq C {v_{k}}^{1 / 2} \to 0 . \end{matrix}

(A10)

Let

T = (Q + q^{1 / 2} Z) - (S + s^{1 / 2} Z)

. Since

q - s = 2 \sum_{h = 1}^{K - 1} (b_{M, h}^{2} - a_{M, h}^{2})

,

\begin{matrix} E | T | & \leq & 2 (K - 1) ρ_{M} + | q^{1 / 2} - s^{1 / 2} | \\ \leq & 2 K ρ_{M} + {| q - s |}^{1 / 2} \leq 2 K ρ_{M} + {(4 K ρ_{M})}^{1 / 2} \to 0 . \end{matrix}

(A11)

Hence, according to (A9)–(A11) and Lemma A1,

sup_{t} | P (| Q + Q^{\circ} | \leq t) - P (| S + S^{\circ} | \leq t) | \leq C^{'} [u_{k}^{1 / 2} + {v_{k}}^{1 / 2} + {(K ρ_{M})}^{1 / 6}] .

By noting

u_{K} = a_{M, K}^{1 / 3} \leq K^{- 1 / 3}

, and

v_{K} = b_{M, K}^{1 / 3} \leq K^{- 1 / 3}

and choosing

K = ρ_{M}^{- 1 / 2}

, (19) follows. □

Proof of Theorem 3.

Let

Σ = {(σ_{j k})}_{j, k = 1}^{p}

and

\hat{Σ} = {({\hat{σ}}_{j k})}_{j, k = 1}^{p}

. Without loss of generality, we assume

E (X_{i}) = 0

. Since

X_{i}

is independent and identically distributed, we have

\begin{matrix} E ∥ \hat{Σ} {- Σ ∥}_{F}^{2} & = E \sum_{j, k = 1}^{p} {({\hat{σ}}_{j k} - σ_{j k})}^{2} \\ = \sum_{j, k = 1}^{p} [\frac{1}{n} E (X_{1 j}^{2} X_{i k}^{2}) - \frac{(n - 2) σ_{j k}^{2}}{n (n - 1)} + \frac{σ_{j j} σ_{k k}}{n (n - 1)}] \\ = \frac{1}{n} [{(\sum_{j = 1}^{p} X_{1 j}^{2})}^{2}] - \frac{(n - 2) Γ^{2}}{n (n - 1)} + \frac{{[E (X_{1}^{T} X_{1})]}^{2}}{n (n - 1)} \\ \leq \frac{1}{n - 1} E [{(X_{1}^{T} X_{1})}^{2}] - \frac{(n - 2) Γ^{2}}{n (n - 1)} . \end{matrix}

which, according to the assumption of

E [{(X_{1}^{T} X_{1})}^{2}] = o (n Γ^{2})

, implies

E ∥ \hat{Σ} {- Σ ∥}_{F}^{2} = o (Γ^{2})

. Then,

E {(Γ - \hat{Γ})}^{2} \leq E {∥ \hat{Σ} - Σ ∥}_{F}^{2} = o (Γ^{2})

, which further indicates

\hat{Γ} / Γ \to 1

in probability. We can also have

\begin{matrix} ∥ ∥ \hat{Σ} / \hat{Γ} {- Σ / Γ ∥}_{F} ∥_{2} \leq ∥ ∥ \hat{Σ} {- Σ) / Γ ∥}_{F} ∥_{2} + ∥ ∥ \hat{Σ} / \hat{Γ} ∥_{F} ∥_{2} \cdot | 1 - \hat{Γ} / Γ | = o (1) . \end{matrix}

□

Proof of Proposition 1.

Note that

∥ Σ / Γ ∥ \leq {∥ Σ / Γ ∥}_{F} = 1

. According to the assumption of

\tilde{Γ} / Γ - 1 = o_{p} (1)

, we have

∥ Σ / Γ ∥ (Γ / \tilde{Γ} - 1) = o_{p} (1)

. Hence, for the “if” part,

\begin{matrix} ∥ \tilde{Σ} / \tilde{Γ} - Σ / Γ ∥ \leq ∥ \tilde{Σ} - Σ ∥ / \tilde{Γ} + ∥ Σ / Γ ∥ \cdot | Γ / \tilde{Γ} - 1 | = o_{p} (1) . \end{matrix}

The “only if” part can be similarly proven. □

Proof of Theorem 4.

(i) Following the definition of

〈 X_{i}, X_{j} 〉

in (A3), for any subset (

S \subset {1, \dots, n}

) with

| S | = m

, we define

Ω_{d, S}^{\circ} = \frac{1}{m} \sum_{i, j \in S : i \neq j} 〈 X_{i}, X_{j} 〉

,

Ω_{d, S^{c}}^{\circ} = \frac{1}{n - m} \sum_{i, j \in S^{c} : i \neq j} 〈 X_{i}, X_{j} 〉

. Then, we can write

\begin{matrix} Ω_{d, S} & = & {(1 - \frac{m}{n})}^{2} \frac{m}{m - 1} Ω_{d, S}^{\circ} + \frac{m {(n - m)}^{2}}{n^{2} (n - m - 1)} Ω_{d, S^{c}}^{\circ} \\ - \frac{2}{n} (1 - \frac{m}{n}) \sum_{i \in S} \sum_{j \in S^{c}} 〈 X_{i}, X_{j} 〉 . \end{matrix}

Let

τ = {(1 - \frac{m}{n})}^{2} \frac{m}{m - 1}

and

Q = \frac{m {(n - m)}^{2}}{n^{2} (n - m - 1)} Ω_{d, S^{c}}^{\circ} - \frac{2}{n} (1 - \frac{m}{n}) \sum_{i \in S} \sum_{j \in S^{c}} 〈 X_{i}, X_{j} 〉 .

For any

x > 0

, according to the triangle inequality, we have

\begin{matrix} P (Ω_{d, S}^{\circ} \leq \frac{t - x}{τ}) - P (| Q | \geq x) & \leq & P (Ω_{d, S} \leq t) \\ \leq & P (Ω_{d, S}^{\circ} \leq \frac{t + x}{τ}) + P (| Q | \geq x) . \end{matrix}

(A12)

Note that

E | \sum_{i \in S} \sum_{j \in S^{c}} 〈 X_{i}, X_{j} 〉 |^{2} = m (n - m) Λ^{2}

. Since

m = o (n)

, according to Theorem 1, we have

Q = o_{p} (1)

and

P (| Q | \geq x) \to 0

. In view of Theorem 1, Lemma A1, and (A12), it follows that for any

k \leq ⌊ n / m ⌋

,

P (Ω_{d, T_{k}} \leq t) - P (Ω_{d, T_{k}}^{\circ} \leq t) \to 0 .

A similar argument implies that for

k \neq k^{'}

, it holds for the joint probability that

P (Ω_{d, T_{k}} \leq t, Ω_{d, T_{k^{'}}} \leq t) - P (Ω_{d, T_{k}}^{\circ} \leq t, Ω_{d, T_{k^{'}}}^{\circ} \leq t) \to 0 .

Therefore, according to Theorem 1, we have

E | \hat{G} {(t) - P (W \leq t) |}^{2} \to 0

, which implies the uniform version (32) via the standard Glivenko–Cantelli argument, together with the continuity result of Lemma A1.

We now prove (ii). Following the argument in (i), it suffices to show that for sets

S, S^{'} \in S

,

\begin{matrix} E | P (Ω_{d, S}^{\circ} \leq t, Ω_{d, S^{'}}^{\circ} \leq t) - P^{2} (W \leq t) | \to 0 . \end{matrix}

(A13)

Let

S \cap S^{'} = D_{1}

,

S - D_{1} = D_{2}

,

S^{'} - D_{1} = D_{3}

and

r = | D_{1} |

. Then, we can write

\begin{matrix} Ω_{d, S}^{\circ} & = & \frac{1}{m} \sum_{i, j \in S : i \neq j} 〈 X_{i}, X_{j} 〉 \\ = & (1 - r / m) Ω_{d, D_{2}}^{\circ} + (r / m) Ω_{d, D_{1}}^{\circ} + (2 / m) \sum_{i \in D_{1}} \sum_{j \in D_{2}} 〈 X_{i}, X_{j} 〉, \end{matrix}

A similar expression exists for

Ω_{d, S^{'}}

. A sequence (

x_{n} \to 0

) with

m / n = o (x_{n})

is chosen. If

r \leq m x_{n}

, as in part (i), we have

| P (Ω_{d, S}^{\circ} \leq t, Ω_{d, S^{'}}^{\circ} \leq t) - P (Ω_{d, D_{2}}^{\circ} \leq t, Ω_{d, D_{3}}^{\circ} \leq t) | \to 0

and

| P (Ω_{d, D_{2}}^{\circ} \leq t) - P (W \leq t) | \to 0

. Note that

E | S \cap S^{'} | \leq m^{2} / n

. Then,

P (| S \cap S^{'} | \geq m x_{n}) \leq m / (n x_{n}) \to 0

. Then, (A13) follows by conditioning on

| S \cap S^{'} | \leq m x_{n}

. □

References

Cramér, H. On the composition of elementary errors. Scand. Actuar. J. 1928, 1928, 13–74. [Google Scholar] [CrossRef]
von Mises, R. Wahrscheinlichkeitsrechnung; Fr. Deuticke: Vienna, Austria, 1931. [Google Scholar]
Darling, D.A. The kolmogorov-smirnov, cramer-von mises tests. Ann. Math. Stat. 1957, 28, 823–838. [Google Scholar] [CrossRef]
Shorack, G.R.; Wellner, J.A. Empirical Processes with Applications to Statistics; Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics; John Wiley & Sons, Inc.: New York, NY, USA, 1986. [Google Scholar]
Smirnov, N. On the distribution of the ω² criterion of von Mises. Rec. Math. (NS) 1937, 2, 973–993. [Google Scholar]
Anderson, T.W.; Darling, D.A. Asymptotic theory of certain “goodness of fit” criteria based onstochastic processes. Ann. Math. Stat. 1952, 23, 193–212. [Google Scholar] [CrossRef]
Bühlmann, P.; Van De Geer, S. Statistics for High-Dimensional Data: Methods, Theory Andapplications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Balakrishnan, S.; Wasserman, L. Hypothesis testing for high-dimensional multinomials: A selectivereview. Ann. Appl. Stat. 2018, 12, 727–749. [Google Scholar] [CrossRef]
Xu, M.; Zhang, D.; Wu, W.B. Pearson’s chi-squared statistics: Approximation theory and beyond. Biometrika 2019, 106, 716–723. [Google Scholar] [CrossRef]
Zhang, D.; Wu, W.B. Gaussian approximation for high dimensional time series. Ann. Stat. 2017, 45, 1895–1919. [Google Scholar] [CrossRef]
Liang, J.; Tang, M.L.; Zhao, X. Testing high-dimensional normality based on classical skewness and Kurtosis with a possible small sample size. Commun. Stat.-Theory Methods 2019, 48, 5719–5732. [Google Scholar] [CrossRef]
Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
Blum, J.R.; Kiefer, J.; Rosenblatt, M. Distribution Free Tests of Independence Based on the SampleDistribution Function. Ann. Math. Stat. 1961, 32, 485–498. [Google Scholar] [CrossRef]
Cotterill, D.S.; Csörgö, M. On the limiting distribution of and critical values for the Hoeffding, Blum, Kiefer, Rosenblatt independence criterion. Stat. Risk Model. 1985, 3, 1–48. [Google Scholar] [CrossRef]
Cotterill, D.S.; Csorgo, M. On the limiting distribution of and critical values for the multivariate Cramer-von Mises statistic. Ann. Stat. 1982, 10, 233–244. [Google Scholar] [CrossRef]
Genest, C.; Rémillard, B. Test of independence and randomness based on the empirical copulaprocess. Test 2004, 13, 335–369. [Google Scholar] [CrossRef]
Genest, C.; Quessy, J.F.; Rémillard, B. Asymptotic local efficiency of Cramér–von Mises tests formultivariate independence. Ann. Stat. 2007, 35, 166–191. [Google Scholar] [CrossRef]
Portnoy, S. On the central limit theorem in R^p when p → ∞. Probab. Theory Relat. Fields 1986, 73, 571–583. [Google Scholar] [CrossRef]
Sklar, M. Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Genest, C.; Nešlehová, J. Copulas and copula models. In Encyclopedia of Environmetrics; Wiley: Chichester, UK, 2012; Volume 2, pp. 541–553. [Google Scholar]
Joe, H. Dependence Modeling with Copulas; CRC Press: London, UK, 2014. [Google Scholar]
Okhrin, O.; Trimborn, S.; Waltz, M. gofCopula: Goodness-of-Fit tests for copulae. R J. 2021, 13, 467–498. [Google Scholar] [CrossRef]
Hering, C.; Hofert, M. Goodness-of-fit tests for Archimedean copulas in high dimensions. InInnovations in Quantitative Risk Management; Springer: Cham, Switzerland, 2015; pp. 357–373. [Google Scholar]
Atkinson, K.E. The Numerical Solution of Integral Equations of the Second Kind; Cambridge Monographs on Applied and Computational Mathematics; Cambridge University Press: Cambridge, UK, 1997; Volume 4. [Google Scholar]
Rosenthal, H.P. On the subspaces of L^p (p > 2) spanned by sequences ofindependent random variables. Isr. J. Math. 1970, 8, 273–303. [Google Scholar] [CrossRef]
Wu, W.B. Nonlinear system theory: Another look at dependence. Proc. Natl. Acad. Sci. USA 2005, 102, 14150–14154. [Google Scholar] [CrossRef]
Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; Johns Hopkins Studies in the Mathematical Sciences; Johns Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
Marčenko, V.A.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Sb. Math. 1967, 1, 457–483. [Google Scholar] [CrossRef]
Wachter, K.W. The strong limits of random matrix spectra for sample matrices ofindependent elements. Ann. Probab. 1978, 6, 1–18. [Google Scholar] [CrossRef]
Geman, S. A limit theorem for the norm of random matrices. Ann. Probab. 1980, 8, 252–261. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerfulapproach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
Nicholson, W.B.; Wilms, I.; Bien, J.; Matteson, D.S. High dimensional forecasting via interpretable vector autoregression. J. Mach. Learn. Res. 2020, 21, 6690–6741. [Google Scholar]
Stock, J.H.; Watson, M.W. An Empirical Comparison of Methods for Forecasting Using Manypredictors; Manuscript; Princeton University: Princeton, NJ, USA, 2005; Volume 46. [Google Scholar]
Koop, G.M. Forecasting with medium and large Bayesian VARs. J. Appl. Econom. 2013, 28, 177–203. [Google Scholar] [CrossRef]
Basu, S.; Michailidis, G. Regularized estimation in sparse high-dimensional time seriesmodels. Ann. Stat. 2015, 43, 1535–1567. [Google Scholar] [CrossRef]
Sakhanenko, A.I. Estimates in the invariance principle in terms of truncated powermoments. Sib. Mat. Zhurnal 2006, 47, 1355–1371. [Google Scholar]

Figure 1. A flowchart of the overall procedure for hypothesis testing.

Figure 2. The empirical cumulative distribution function of the p values under

H_{0}^{M}

with

a = 0