Next Article in Journal
A Unified Formal Framework for Factorial and Probabilistic Topic Modelling
Previous Article in Journal
Facultative Mutualisms and θ-Logistic Growth: How Larger Exponents Promote Global Stability of Co-Existence Equilibria
Previous Article in Special Issue
Minimum Residual Sum of Squares Estimation Method for High-Dimensional Partial Correlation Coefficient
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach

1
Department of Information Systems and Analytics, National University of Singapore, Singapore 117417, Singapore
2
School of Statistics and Mathematics, Guangdong University of Finance and Economics, Guangzhou 510320, China
3
National Institute of Education, Nanyang Technological University, Singapore 637616, Singapore
4
Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(20), 4374; https://doi.org/10.3390/math11204374
Submission received: 25 September 2023 / Revised: 18 October 2023 / Accepted: 18 October 2023 / Published: 21 October 2023
(This article belongs to the Special Issue Advances of Functional and High-Dimensional Data Analysis)

Abstract

:
With the development of modern data collection techniques, researchers often encounter high-dimensional data across various research fields. An important problem is to determine whether several groups of these high-dimensional data originate from the same population. To address this, this paper presents a novel k-sample test for equal distributions for high-dimensional data, utilizing the Maximum Mean Discrepancy (MMD). The test statistic is constructed using a V-statistic-based estimator of the squared MMD derived for several samples. The asymptotic null and alternative distributions of the test statistic are derived. To approximate the null distribution accurately, three simple methods are described. To evaluate the performance of the proposed test, two simulation studies and a real data example are presented, demonstrating the effectiveness and reliability of the test in practical applications.

1. Introduction

Testing whether multiple samples follow the same distribution is a fundamental challenge in data analysis, with wide-ranging applications across diverse fields. Traditional nonparametric tests designed for comparing the distributions of two samples, such as the Wald–Wolfowitz runs test, the Mann–Whitney–Wilcoxon test based on signed ranks, and the Kolmogorov–Smirnov test utilizing the Empirical Distribution Function (EDF), are well established for univariate data [1]. Extending these tests to the multivariate setting in R p has been the focus of extensive research. This has led to the development of novel tests based on multivariate runs, ranks, EDF, distances, and projections, as pioneered by researchers like [2,3,4,5], among others.
In this paper, our primary focus is on addressing the multi-sample problem for equal distributions in high-dimensional data. In contemporary data analysis, high-dimensional datasets have become increasingly prevalent and easily accessible across various domains. For instance, Section 5 introduces a dataset derived from a keratoconus study involving corneal surfaces. This collaborative effort involves Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. The dataset comprises 150 corneal surfaces, each characterized by over 6000 height measurements. These surfaces are categorized into four distinct groups based on their corneal shapes, prompting the question of whether these groups share a common underlying distribution. Consequently, there is a pressing need to develop tests that can assess distributional equality in the context of high-dimensional data.
Mathematically, a multi-sample problem for high-dimensional data can be described as follows. Assume that we have the following k samples of observed random elements in R p :
y α 1 , , y α n α i . i . d . F α , α = 1 , , k ,
where the data dimension p can be very large and F α , α = 1 , , k , are unknown cumulative distribution functions. Of interest is to test if the k distribution functions are the same:
H 0 : F 1 = = F k , vs . H 1 : F α F β   for   some α β .
When k = 2 , a variety of distance-based tests designed for multivariate data, such as those proposed by [2,3,4,5], can potentially be applied to test (2) in the context of high-dimensional data. However, it has been demonstrated by [6] that these tests may lack the power to effectively detect differences in distribution scales in high dimensions. In response to this challenge, ref. [6] introduced a test based on interpoint distances, while [6] proposed a high-dimensional two-sample run test based on the shortest Hamiltonian path. Nevertheless, ref. [7] showed that these tests, along with a graph-based test by [8], are less effective in detecting differences in location. To address this limitation, ref. [7] introduced two asymptotic normal tests for both location and scale differences, based on interpoint distances. However, it is worth noting that these tests rely on a strong mixing condition and impose a natural ordering on the components of high-dimensional observations, limiting their applicability. Furthermore, these tests involve complex U-statistic estimators, making them computationally intensive. Several other approaches have also been proposed for high-dimensional distribution testing. Ref. [9] presented a high-dimensional permutation test using a symmetric measure of distance between data points, but it comes with high computational costs. Refs. [10,11] proposed tests based on projections, which are generally more effective in detecting low-dimensional distributional differences. Ref. [12] introduced a test based on energy distance and permutation, but it is less powerful in detecting scale differences. Refs. [13,14] proposed kernel two-sample tests based on the Maximum Mean Discrepancy (MMD). Ref. [14] demonstrated the equivalence between the energy-distance-based test and the kernel-based test, showing that the energy-distance-based test can be viewed as a kernel test utilizing a kernel induced by the interpoint distance. The MMD leverages the kernel trick to define a distance between the embeddings of distributions in Reproducing Kernel Hilbert Spaces (RKHSs). It is well suited for checking distribution differences among several high-dimensional samples and is applicable to various data types, such as vectors, strings, or graphs. Recently, there has been further investigation into unbiased and biased MMD-based two-sample tests for high-dimensional data by [15,16], respectively.
On the other hand, when k > 2 , there are limited tests available for testing (2). One such test is the energy test developed by [12]. This test statistic is obtained by directly summing all pairwise energy distances, with its null distribution approximated through permutation. However, this test has some drawbacks, including being time-consuming and yielding p-values that may vary when applied multiple times to the same dataset, as supported by the evidence from Figure 3 in Section 5. Another approach is presented by [17], who extended the idea of MMD from the two-sample problem to the multi-sample problem for equal distributions. They constructed an MMD-based test statistic capable of detecting deviations from distribution homogeneity in several samples. However, the resulting test statistic and its null limit distribution are very complicated in form, which restricts its practical use. To address this limitation, ref. [18] developed a new MMD-based test for testing (2). This test statistic is constructed using U-statistics, making it easy to conduct and yielding accurate results.
In this paper, we maintain our focus on the MMD-based approach for testing (2). However, unlike [18], where a U-statistic technique is employed to construct the test statistic, we take a distinct path by constructing an L 2 -norm-based test statistic. To achieve this, we first employ a canonical feature map to transform the k original samples (1) into k induced samples (3). Concurrently, we transform the k-sample equal distribution testing problem (2) into a mean vector testing problem (4). This transformation facilitates the straightforward construction of an L 2 -norm-based test (5) for assessing the mean vector testing problem (4). Leveraging a kernel trick, we derive a formula for computing the L 2 -norm-based test statistic using the k original samples (1).
Additionally, this paper makes several other significant contributions. Firstly, akin to the work of [17,18], we extend the concept of MMD from two-sample problems to the domain of multi-sample problems with equal distributions. Secondly, we derive the asymptotic null and alternative distributions of the proposed test statistic. Thirdly, we offer three distinct approaches for approximating the null distribution of the MMD-based test statistic, utilizing parametric bootstrap, random permutation, and the Welch–Satterthwaite (W–S) χ 2 -approximation methods. Lastly, we examine two specific scenarios in our comprehensive simulation studies. In the first scenario, the samples have the same mean vector but different covariance matrices, while in the second scenario, the samples exhibit both distinct mean vectors and covariance matrices. Our simulation results demonstrate that the tests we propose effectively maintain precise control over size in both scenarios. However, in terms of empirical power, they outperform (underperform) the energy test introduced by [12] in the first (second) scenario. In other words, when the primary difference in the distributions of the samples lies in their covariance matrices, the new tests are the preferred choice in terms of statistical power.
The remainder of this paper is organized as follows: Section 2 presents the main results and Section 3 introduces three methods to implement our test. In Section 4, we provide two simulation studies. Section 5 showcases an application to the corneal surface data mentioned earlier. Concluding remarks can be found in Section 6. Technical proofs of the main results are included in Appendix A.

2. Main Results

2.1. MMD for Several Distributions

In this section, we show how the MMD can be defined for several distributions. Let H be an RKHS associated with a characteristic reproducing kernel K ( · , · ) . For any u and v in H , the inner product and L 2 -norm of H are defined as u , v and u = u , u , respectively. Let ϕ ( · ) be the canonical feature mapping associated with K ( · , · ) , i.e., ϕ ( y ) = K ( · , y ) . Using this feature mapping and setting x α i = ϕ ( y α i ) for i = 1 , , n α , we obtain the following k induced samples in the RKHS H :
x α 1 , , x α n α , α = 1 , , k ,
which are derived from the k original samples in (1). Define μ α = E ( x α 1 ) for α = 1 , , k , representing the mean embeddings of the k distributions F α , where α = 1 , , k .
Ref. [13] established the MMD for two distributions in a separable metric space (see, e.g., Theorem 5 of [13]). Here, we extend it naturally for multi-sample distributions in R p . According to the MMD of [13], for any α β , where α , β = 1 , , k , “testing H 0 : F α = F β vs. H 1 : F α F β ” based on the two samples y α i , i = 1 , , n α and y β i , i = 1 , , n β is equivalent to “testing H 0 : μ α = μ β vs. H 1 : μ α μ β ” based on the two samples x α i , i = 1 , , n α and x β i , i = 1 , , n β . Therefore, testing (2) using the k original samples in (1) is equivalent to testing the following hypothesis using the k induced samples in (3):
H 0 : μ 1 = = μ k , vs . H 1 : μ α μ β   for   some   α β .
To test (4), following [19], a natural L 2 -norm-based test statistic using (3) is given by
T n = α = 1 k n α x ¯ α x ¯ 2 ,
where x ¯ α = n α 1 i = 1 n α x α i and x ¯ = n 1 α = 1 k i = 1 n α x α i denote the group and grand sample means, respectively. Through some simple algebra, as given in Appendix A, we can express T n as
T n = α = 1 k n α ( n n α ) n x ¯ α 2 2 1 α < β k n α n β n x ¯ α , x ¯ β = 1 α < β k n α n β n x ¯ α x ¯ β 2 .
Let π = ( π 1 , , π k ) denote the weights of the k distributions F 1 , , F k such that π 1 , , π k ( 0 , 1 ) and α = 1 k π α = 1 . Then, when we estimate π using ( n 1 , , n k ) / n , the test statistic T n / n estimates the following quantity:
MMD 2 ( F 1 , , F k | π ) = 1 α < β k π α π β μ α μ β 2 ,
which can be naturally defined as the MMD of the k distributions F 1 , , F k with the weight vector π . It is worth noting that the MMD for multiple distributions presented above is equivalent to the one derived by [18], offering a much simpler alternative compared to the formulation proposed by [17].
It is easy to justify that MMD 2 ( F 1 , , F k | π ) is indeed an MMD of the k distributions F 1 , , F k in the sense that MMD 2 ( F 1 , , F k | π ) = 0 if and only if F 1 = = F k . On the one hand, when MMD 2 ( F 1 , , F k | π ) = 0 , we have μ 1 = = μ k so that for any 1 α β k , we have μ α = μ β , implying that for any 1 α β k , we have F α = F β and hence F 1 = = F k . On the other hand, when F 1 = = F k , for any 1 α β k , we have F α = F β so that for any 1 α β k , we have μ α = μ β and hence μ 1 = = μ k . Therefore, we can use T n to test the equality of k distributions based on the k induced samples (3).

2.2. Computation of the Test Statistic

Notice that the k induced samples (3) are not directly computable, as the canonical feature mapping ϕ ( y ) is explicitly defined through the reproducing kernel. Fortunately, the reproducing kernel K ( · , · ) and its canonical feature mapping ϕ ( · ) can be utilized with the following useful kernel trick: K ( y , y ) = ϕ ( y ) , ϕ ( y ) . Using this, we can express the inner product as follows:
x α i , x β j = ϕ ( y α i ) , ϕ ( y β j ) = K ( y α i , y β j ) .
Let V α α = x ¯ α 2 = x ¯ α , x ¯ α and V α β = x ¯ α , x ¯ β . Then, using (7), we have
V α α = 1 n α 2 i = 1 n α j = 1 n α K ( y α i , y α j ) , and V α β = 1 n α n β i = 1 n α j = 1 n β K ( y α i , y β j ) .
Therefore, using (6), we can compute T n using any of the following useful expressions:
T n = α = 1 k n α ( n n α ) n V α α 2 1 α < β k n α n β n V α β = 1 α < β k n α n β n ( V α α + V β β 2 V α β ) .
In other words, you can compute the value of T n using the original k samples (1) and the above expressions.

2.3. Asymptotic Null Distribution

To explore the null distribution of T n , we can rewrite T n as
T n = T ˜ n + 2 S n + α = 1 k n α μ α μ ¯ 2 ,
where μ ¯ = n 1 α = 1 k n α μ α represents the weighted average of the mean embeddings of the k distributions. Additionally, we have
T ˜ n = α = 1 k n α ( x ¯ α μ α ) ( x ¯ μ ¯ ) 2 , and S n = α = 1 k n α ( x ¯ α μ α ) ( x ¯ μ ¯ ) , μ α μ ¯ .
In the expression for T ˜ n , you can observe that the mean embeddings of the k distributions have been subtracted. Consequently, T ˜ n follows the same distribution as that of T n under the null hypothesis. Thus, studying the null distribution of T n is equivalent to studying the distribution of T ˜ n .
Similar to the proof of (6), we can express T ˜ n as follows:
T ˜ n = α = 1 k n α ( n n α ) n x ¯ α μ α 2 2 1 α < β k n α n β n x ¯ α μ α , x ¯ β μ β = 1 α < β k n α n β n ( x ¯ α μ α ) ( x ¯ β μ β ) 2 .
Let K ˜ ( u , v ) denote the centered version of K ( u , v ) , defined as
K ˜ ( u , v ) = ϕ ( u ) μ u , ϕ ( v ) μ v = K ( u , v ) E v [ K ( u , v ) ] E u [ K ( u , v ) ] + E u , v [ K ( u , v ) ] ,
where μ u = E [ ϕ ( u ) ] , μ v = E [ ϕ ( v ) ] , and u and v are independent copies of u and v , respectively. We can observe two useful properties: when u = v , we have
E u [ K ˜ ( u , u ) ] = E u ϕ ( u ) μ u 2 > 0 .
When u and v are independent, we have
E u [ K ˜ ( u , v ) ] = E v [ K ˜ ( u , v ) ] = E u , v [ K ˜ ( u , v ) ] = 0 .
Using (12), we can express
K ˜ ( y α i , y β j ) = x α i μ α , x β j μ β .
Let
V ˜ α α = x ¯ α μ α 2 , and V ˜ α β = x ¯ α μ α , x ¯ β μ β .
It is evident that V ˜ α α and V ˜ α β can be considered as centered versions of V α α and V α β , respectively. Consequently, by using (11) and (16), we can express
T ˜ n = α = 1 k n α ( n n α ) n V ˜ α α 2 1 α < β k n α n β n V ˜ α β .
Utilizing (15) and performing some straightforward algebraic manipulations, we can express
V ˜ α α = 1 n α 2 i = 1 n α j = 1 n α K ˜ ( y α i , y α j ) = 1 n α 2 i = 1 n α K ˜ ( y α i , y α i ) + 2 1 i < j n α K ˜ ( y α i , y α j ) , and V ˜ α β = 1 n α n β i = 1 n α j = 1 n β K ˜ ( y α i , y β j ) , α β .
Assuming that K ˜ ( y , y ) is square-integrable, i.e., E [ K ˜ 2 ( y , y ) ] < , we can express K ˜ ( y , y ) using Mercer’s expansion:
K ˜ ( y , y ) = r = 1 λ r ψ r ( y ) ψ r ( y ) ,
where λ 1 , λ 2 , represent the eigenvalues of K ˜ ( y , y ) , and ψ 1 ( y ) , ψ 2 ( y ) , are the corresponding orthonormal eigenelements, satisfying
K ˜ ( y , y ) ψ r ( y ) d F ( y ) = λ r ψ r ( y ) , and ψ r ( y ) ψ s ( y ) d F ( y ) = δ r s , r , s = 1 , 2 , ,
where δ r s equals 1 when r = s and 0 otherwise. Now, let us introduce the following conditions:
C1.
We have y α 1 , , y α , n α , α = 1 , , k i . i . d . F .
C2.
As n , we have n α / n τ α ( 0 , 1 ) , α = 1 , , k .
C3.
K ( y , y ) is a reproduced kernel such that E y [ K ˜ 2 ( y , y ) ] .
Condition C1 assumes that the null hypothesis is satisfied and the common distribution function is F. Condition C2 is a regularity condition for k-sample problems and it requires that the group sample sizes tend to proportionally. Condition C3 is required such that K ˜ ( y , y ) is square-integrable and expression (19) is valid.
In fact, under Condition C3, using (19) and the Cauchy–Schwarz inequality, we obtain the following results:
E [ K ˜ ( y , y ) ] = r = 1 λ r < E [ K ˜ 2 ( y , y ) ] < , E y , y [ K ˜ ( y , y ) ] 2 = r = 1 λ r 2 < E y , y [ K ˜ ( y , y ) K ˜ ( y , y ) ] = E y 2 [ K ˜ ( y , y ) ] < ,
where y , y i . i . d . F . These inequalities hold due to the square-integrability assumption and the properties of Mercer’s expansion. Now, let us state the following theorem that establishes the asymptotic distribution of T ˜ n .
Theorem 1.
Under Conditions C1–C3, as n , we have T ˜ n L T ˜ , where
T ˜ = d r = 1 λ r A r , A r i . i . d . χ k 1 2 .
It is worth highlighting that the limit null distribution of the proposed test statistic differs from the one derived in [18] (Theorem 1) and it offers a more straightforward alternative compared to the limit null distribution obtained by [17] (Theorem 1). This explains why the limit null distribution presented by [17] is not employed to approximate the null distribution of their test statistic. However, as demonstrated in Section 3, it is indeed feasible to utilize this distribution if desired.

2.4. Mean and Variance of T ˜ n

Based on (13), (14) and (18), through some simple calculation, under Condition C1, we have
E [ V ˜ α α ] = 1 n α E [ K ˜ ( y , y ) ] , E [ V ˜ α β ] = 0 , Var [ V ˜ α α ] = 1 n α 3 Var [ K ˜ ( y , y ) ] + 2 ( n α 1 ) n α 3 E [ K ˜ 2 ( y , y ) ] , Var [ V ˜ α β ] = 1 n α n β E [ K ˜ 2 ( y , y ) ] , and Cov V ˜ α α , V ˜ α β = 0 , α β ,
where y , y i . i . d . F .
Theorem 2.
Under Condition C1, we have E ( T ˜ n ) = ( k 1 ) E K ˜ ( y , y ) , and
Var ( T ˜ n ) = α = 1 k ( n n α ) 2 n 2 n α Var [ K ˜ ( y , y ) ] + 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α E [ K ˜ 2 ( y , y ) ] ,
where y , y i . i . d . F .
Note that under Condition C2, we have α = 1 k ( n n α ) 2 / ( n 2 n α ) α = 1 k n α 1 0 as n . Then as n , we have Var ( T ˜ n ) = 2 ( k 1 ) E [ K ˜ 2 ( y , y ) ] [ 1 + o ( 1 ) ] .

2.5. Asymptotic Power

In this subsection, we examine the asymptotic power of the proposed test under the following local alternative hypothesis:
H 1 n : μ α = μ + n ( 1 / 2 Δ ) h α , α = 1 , , k ,
where 0 < Δ 1 / 2 and h α , α = 1 , , k are constant elements in the RKHS H such that
0 < α = 1 k n α n h α h ¯ 2 B h <   with   h ¯ = 1 n α = 1 k n α h α ,
and
α = 1 k σ α 2 > 0 , where σ α 2 = E x α 1 μ α , h α h ¯ 2 , α = 1 , , k .
It is important to note that when Δ = 1 / 2 , the local hypothesis (24) simplifies into a fixed alternative hypothesis H 1 : μ α = μ + h α for α = 1 , , k . However, when 0 < Δ < 1 / 2 , Equation (24) represents a strict local alternative hypothesis. In this case, as n , the strict local hypothesis tends to converge toward the null hypothesis. Detecting a strict local alternative hypothesis becomes exceedingly challenging in such scenarios. A test is typically considered root-n-consistent if it can detect a strict local alternative hypothesis with a probability approaching 1 as n . A root-n-consistent test is considered effective because it achieves the best possible detection rate for a local alternative hypothesis as n grows.
Through (9) and (24), T n can be written as
T n = T ˜ n + 2 S n + n 2 Δ α = 1 k n α n h α h ¯ 2 ,
where
S n = n ( 1 / 2 Δ ) α = 1 k n α x ¯ α μ α , h α h ¯ .
Theorem 3.
Assume that | K ( y , y ) | B K for all y , y R p for some B K < . Then, we have Var ( T ˜ n ) 32 ( k 1 ) B K 2 and Var ( S n ) = n 2 Δ α = 1 k n α σ α 2 / n , where α = 1 k n α σ α 2 / n 4 B K B h < .
Theorem 4.
Assume that | K ( y , y ) | B K for all y , y R p for some B K < . Then, under Condition C2 and the local alternative hypothesis (24), as n , we have (a) T ˜ n / Var ( S n ) P 0 ; (b) S n / Var ( S n ) L N ( 0 , 1 ) ; and (c)
Pr ( T n C ϵ ) = Φ n Δ α = 1 k τ α h α h * 2 2 α = 1 k τ α σ α * 2 [ 1 + o ( 1 ) ] 1 ,
where C ϵ denotes the upper 100 ϵ percentile of T ˜ n with ϵ being the given significance level, τ α ; α = 1 , , k are defined in Condition C2; h * = α = 1 k τ α h α ; σ α * 2 = E [ x α 1 μ α , h α h * ] 2 ; and Φ ( · ) denotes the cumulative distribution of N ( 0 , 1 ) .
Theorem 4 shows that the proposed test is indeed a root-n-consistent test.

3. Methods for Implementing the Proposed Test

In this section, we will outline three different approaches for approximating the null distribution of T n (8) in order to conduct the proposed test. These methods include a parametric bootstrap approach, a random permutation technique, and a χ 2 -approximation method. We will evaluate and compare their performance in the next section.

3.1. Parametric Bootstrap Method

Theorem 1 reveals that the asymptotic null distribution of T n takes the form of a chi-squared-type mixture denoted as T ˜ (22). The coefficients of this mixture are determined by the unknown eigenvalues λ r , r = 1 , 2 , of K ˜ ( y , y ) , where y , y are independently and identically distributed according to the common distribution function F representing the k distributions when the null hypothesis is valid. Consequently, in order to estimate the asymptotic null distribution of T n , it is essential to consistently estimate λ r , r = 1 , 2 , . This consistency can be achieved by utilizing the empirical eigenvalues of the centered Gram matrix, as suggested by [20], to construct a reliable estimator for T ˜ (22).
Let us recall that n = α = 1 k n α represents the total sample size. We pool the k samples (1) and denote it as
y 1 , , y n .
Under the null hypothesis, y 1 , , y n are independently and identically distributed from the common distribution F. Let K represent the n × n Gram matrix, where the ( i , j ) th entry is defined as K ( y i , y j ) for i , j = 1 , , n . Additionally, let 1 n denote a vector of ones with dimensions n × 1 , and I n denote the identity matrix of size n × n . Then, the matrix P n = I n 1 n 1 n / n is a projection matrix of size n × n with rank n 1 .
Now, define K ˜ n = P n K P n , commonly referred to as the centered Gram matrix. Its ( i , j ) th entry is given by
K ˜ n ( y i , y j ) = K ( y i , y j ) n 1 v = 1 n K ( y i , y v ) n 1 u = 1 n K ( y u , y j ) + n 2 u = 1 n v = 1 n K ( y u , y v ) ,
where i , j = 1 , , n . As n approaches infinity, for any fixed i and j, we can observe that, by the law of large numbers:
K ˜ n ( y i , y j ) L K ˜ ( y i , y j ) = K ( y i , y j ) E y [ K ( y i , y ) ] E y [ K ( y , y j ) ] + E y , y [ K ( y , y ) ] .
Let ϖ 1 , , ϖ q be all the non-zero eigenvalues of K ˜ n that can be obtained via an eigen-decomposition of K ˜ n . Set λ ^ r = ϖ r / n , r = 1 , , q . Then, following [20], we can show that under Condition C3, the distribution of T ˜ can be consistently estimated by
T ˜ ^ = r = 1 q λ ^ r A r , A r i . i . d . χ k 1 2 .
The parametric bootstrap method can be described as follows. Let us choose a large value for N, for example, N = 1000 . Using expression (28), we can obtain a sample T ˜ ^ ( 1 ) , , T ˜ ^ ( N ) of T ˜ ^ by independently generating A r , r = 1 , , q a total of N times. Now, let T o b s represent the observed test statistic calculated using (8) based on the k samples (1). Using the parametric bootstrap method, we can conduct the proposed test by calculating the approximate p-value, which is given by N 1 i = 1 N I { T ˜ ^ ( i ) T o b s } , where I { S } is an indicator function that takes 1 when S is a true event and 0 otherwise.

3.2. Random Permutation Method

We can also approximate the null distribution of T n using a random permutation method. Let 1 , , n represent a random permutation of the indices 1 , , n from the pooled sample (26). Consequently, the sequence
y 1 , , y n ,
forms a permutation of the pooled sample (26). To create permuted samples, we utilize the first n 1 observations in the permuted pooled sample (29) as the first permuted sample, the next n 2 observations as the second permuted sample, and so on, until we obtain k permuted samples. These permuted samples are denoted as
y α i * , i = 1 , , n α ; α = 1 , , k .
The permutated test statistic, denoted as T n * , is calculated using (8) but with the k samples (1) replaced by the k permuted samples (30).
The random permutation method proceeds as follows. Let N be a sufficiently large number, for instance, N = 1000 . Suppose we repeat the permutation process described above N times, resulting in N permutated test statistics denoted as T n * ( i ) , i = 1 , , N . Then, we can use the empirical distribution of T n * ( i ) , i = 1 , , N to approximate the null distribution of T n . Recall that T o b s represents the test statistic computed using (8) based on the k original samples (1). Following the random permutation method, the proposed test can be conducted by calculating the approximated p-value, given by N 1 i = 1 N I { T n * ( i ) T o b s } .

3.3. Welch–Satterthwaite χ 2 -Approximation Method

The parametric bootstrap method and the random permutation method are effective for controlling size but can be computationally intensive, particularly with large total sample sizes. To address this issue, we can utilize the well-known Welch–Satterthwaite (W–S) χ 2 -approximation method [21,22]. This method is known to be reliable for approximating the distribution of a chi-squared-type mixture. Theorem 1 demonstrates that the asymptotic null distribution of T n is a chi-squared-type mixture T ˜ (22).
The core concept of the W–S χ 2 -approximation method is to approximate the null distribution of T n using that of a random variable of the form: W = d β χ d 2 , where β and d are unknown parameters. These parameters can be determined by matching the means and variances of T ˜ n and W, where T ˜ n , defined in (10), has the same distribution as T n under the null hypothesis. Specifically, the mean and variance of W are β d and 2 β 2 d , respectively, while the mean and variance of T ˜ n are given in Theorem 2. Equating the means and variances of T ˜ n and W, we obtain
β = Var ( T ˜ n ) 2 E ( T ˜ n ) and d = 2 E 2 ( T ˜ n ) Var ( T ˜ n ) .
To implement the W–S χ 2 -approximation method, we need to consistently estimate E ( T ˜ n ) and Var ( T ˜ n ) based on the pooled sample (26) from the k samples (1). According to Theorem 2, these estimates can be obtained as follows:
E ^ ( T ˜ n ) = ( k 1 ) E ^ [ K ˜ ( y , y ) ] , and Var ^ ( T ˜ n ) = α = 1 k ( n n α ) 2 n 2 n α Var ^ [ K ˜ ( y , y ) ] + 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α E ^ [ K ˜ 2 ( y , y ) ] ,
where
E ^ [ K ˜ ( y , y ) ] = n 1 i = 1 n K ˜ n ( y i , y i ) , Var ^ [ K ˜ ( y , y ) ] = ( n 1 ) 1 i = 1 n K ˜ n ( y i , y i ) n 1 j = 1 n K ˜ n ( y j , y j ) 2 , and E ^ [ K ˜ 2 ( y , y ) ] = 2 n ( n 1 ) 1 i < j n K ˜ n ( y i , y j ) 2 ,
with K ˜ n ( y i , y j ) being defined in (27). Substituting these estimators into (31), we obtain
β ^ = Var ^ ( T ˜ n ) 2 E ^ ( T ˜ n ) and d ^ = 2 E ^ ( T ˜ n ) 2 Var ^ ( T ˜ n ) .
Let χ d ^ 2 ( α * ) denote the upper 100 α * percentile of χ d ^ 2 , where α * is the given significance level, and let T o b s denote the observed test statistic computed using (8) based on the k samples (1). Then, through the W–S χ 2 -approximation method, the proposed test can be conducted via rejecting the null hypothesis when T o b s > β ^ χ d ^ 2 ( α * ) or when the approximated p-value Pr χ d ^ 2 > T o b s / β ^ is less than α * .

4. Simulation Studies

In this section, we delve into intensive simulation studies aimed at assessing the performance of the test we propose when compared to the energy test introduced by [12], which we denote as T S R . Our proposed test employs the parametric bootstrap, the random permutation, and the W–S χ 2 -approximation methods as described in Section 3. For simplicity, we refer to the resulting tests as T P B , T R P , and T W S , respectively.
For simplicity, we opt for the Gaussian Radial Basis Function (RBF) kernel, denoted as K ( · , · ) , which is defined as follows:
K ( y , y ) = exp y y 2 2 σ 2 .
Here, σ 2 is referred to as the kernel width. It is worth noting that the Gaussian RBF kernel described above is bounded by 1, ensuring that Condition C3 is always met. Following the approach outlined in [20], we set σ to be equal to the median distance between observed vectors in the pooled sample (26).
We also employ the Average Relative Error (ARE) introduced by [23] to evaluate the overall effectiveness of a test in maintaining its nominal size. The ARE is calculated as follows: A R E = 100 M 1 i = 1 M | α i ^ * α * | / α * , where α i ^ * , i = 1 , , M represents the empirical sizes observed across M different simulation settings. A smaller ARE value indicates the better performance of a test in terms of size control. In this simulation study, we set the nominal size to α * = 5 % .

4.1. Simulation 1

We set k = 3 for simplicity. We generate the k = 3 samples (1) as follows. We set
y 1 i = μ + Γ ( u 1 i + δ 1 v 1 i ) , i = 1 , , n 1 , y 2 i = μ + Γ ( u 2 i + δ 2 v 2 i ) , i = 1 , , n 2 , y 3 i = μ + Γ u 3 i , i = 1 , , n 3 ,
where u α i , i = 1 , , n α ; α = 1 , 2 , 3 i . i . d . N p ( 0 , I p ) , while v α i = ( v α , i 1 , , v α , i p ) , i = 1 , , n α ; α = 1 , 2 are generated using the following three models:
Model 1.
v α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . N ( 0 , 1 ) .
Model 2.
v α , i r = u α , i r / 2 , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p with u α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . t 4 .
Model 3.
v α , i r = ( u α , i r 1 ) / 2 , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p with u α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . χ 1 2 .
It is important to note that δ 1 and δ 2 play pivotal roles as tuning parameters that govern the similarity of the distributions among the three generated samples. Specifically, when both δ 1 and δ 2 are set to zero ( δ 1 = δ 2 = 0 ), the three samples generated from the three models exhibit identical distributions. If at least one of δ i , where i can be 1 or 2, is non-zero, the samples still share the same mean vector but differ in their covariance matrices. Furthermore, it is worth mentioning that the test’s power increases as both δ 1 and δ 2 increase.
We set μ as h ( 1 , , p ) / i = 1 p i 2 and Γ as a ( 1 ρ ) I p + ρ J p , where J p denotes a matrix of ones with dimensions p × p . To assess the performance of the considered tests across a range of dimensionality settings, we examine three cases: p = 10 , p = 100 , and p = 500 . For each of these cases, we consider three sets of sample sizes ( n 1 , n 2 , n 3 ) : ( 20 , 30 , 40 ) , ( 80 , 120 , 160 ) , and ( 160 , 240 , 320 ) . Additionally, we investigate three levels of correlation: ρ = 0.1 , ρ = 0.5 , and ρ = 0.9 . These three values of ρ correspond to samples with varying degrees of correlation, ranging from nearly uncorrelated to moderately correlated and highly correlated. Notably, correlation increases as ρ grows. For simplicity, we set h = 2 and a = 1.5 across all three models.
In the case of the parametric bootstrap and random permutation methods, as well as the energy test, we use a total of N = 1000 replicates for computing the p-values at each simulation run, as described in Section 3.1 and Section 3.2. It is worth noting that the W–S χ 2 -approximation method, which does not require generating replicates, is the least time-consuming among the methods considered. The empirical sizes and powers are computed based on 1000 simulation runs.
Table 1 provides an overview of the empirical sizes of T P B , T R P , T W S , and T S R , with the last row displaying the associated ARE values for the three different ρ values. Several observations can be made based on this table: Firstly, for nearly uncorrelated samples ( ρ = 0.1 ), T W S exhibits a slight tendency to be liberal, with an ARE value of 16.22 that is marginally higher than the ARE values of the other three tests. Secondly, when the generated samples are moderately correlated ( ρ = 0.5 ) or highly correlated ( ρ = 0.9 ), all four tests demonstrate fairly similar empirical sizes and ARE values, making them comparable in terms of size control. Finally, it is seen that the influence of sample sizes on size control is relatively minor, even though, in theory, a larger total sample size should result in better size control.
Figure 1 displays the empirical powers of all four tests in scenarios where all three generated samples have the same mean vectors, but they differ from each other in covariance matrices. Several conclusions can be drawn regarding these power values: Firstly, for ρ = 0.1 , 0.5 , and 0.9 , T P B , T R P , and T W S exhibit similar empirical powers. This suggests that these three tests perform comparably, regardless of whether the data are nearly uncorrelated, moderately correlated, or highly correlated. Secondly, it is seen that under similar settings, as expected, the empirical powers of the tests generally increase with larger sample sizes. Finally, the empirical powers of T S R consistently rank the lowest among all four tests. This indicates that T S R is less powerful compared to the other three tests in these scenarios.

4.2. Simulation 2

Certainly, it should be acknowledged that the MMD-based tests T P B , T R P , and T W S may not consistently demonstrate superior performance when compared to the energy-distance-based test T S R as in Simulation 1. To illustrate this point, in the context of this simulation study, we keep the same experimental framework as described in Simulation 1. However, we now introduce a new collection of three models, which are defined as follows:
Model 4.
v α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . N ( 0.5 , 1 ) .
Model 5.
v α , i r = u α , i r / 2 + 0.5 , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p with u α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . t 4 .
Model 6.
v α , i r = ( u α , i r 1 ) / 2 + 0.5 , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p with u α , i r , i = 1 , , n α ; α = 1 , 2 ; r = 1 , , p i . i . d . χ 1 2 .
Please be aware that v α , i r , where i ranges from 1 to n α , α takes values 1 and 2, and r ranges from 1 to p, is adjusted to ensure that E ( v α , i r ) = 0.5 and Var ( v α , i r ) = 1 across all three models. When both δ 1 and δ 2 are set to 0, the three generated samples follow the same distributions as in Simulation 1. Consequently, this implies that the empirical sizes of all four tests in Simulation 2 will be similar to those observed in Simulation 1. However, when at least one of δ i (where i takes values 1 and 2) is non-zero, the three samples exhibit distinct mean vectors and covariance matrices, differing from those observed in Simulation 1. As a result, our focus should be on calculating the empirical powers of all four tests based on Models 4–6.
Figure 2 presents the empirical powers of all four tests based on Models 4–6, offering several noteworthy insights. First, it is evident that T P B , T R P , and T W S demonstrate similar empirical powers. This implies that these three tests exhibit comparable performance regardless of whether the generated data follow a normal or non-normal distribution. Second, it is also observed that under similar settings, the empirical powers of the tests generally increase with larger sample sizes. Lastly, when ρ takes on values of 0.5 and 0.9, the empirical powers of T S R surpass those of the other three tests. However, when ρ = 0.1 , the empirical powers of all four tests are generally comparable. This indicates that, in the scenarios under consideration, T S R demonstrates greater effectiveness compared to the other three tests when the correlation coefficient ρ is relatively high.
From these two simulation studies, it can be inferred that the proposed MMD-based tests T P B , T R P , and T W S may outperform the energy-distance-based test T S R when the differences in distributions are primarily in covariance matrices, while the reverse could be true when the differences in distribution involve both mean vectors and covariance matrices. Notably, the MMD-based test T W S generally requires less computational effort compared to the bootstrap or permutation-based tests T P B , T R P , and T S R .

5. Application to the Corneal Surface Data

The corneal surface data are briefly mentioned in Section 1. They were acquired during a keratoconus study, a collaborative project involving Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. This dataset comprises 150 observations with each corneal surface having more than 6000 height measurements. It can be categorized into four distinct groups: a group of 43 healthy corneas (referred to as the normal cornea group), a group of 14 corneas with unilateral suspect characteristics, a group of 21 corneas with suspect map features, and a group of 72 corneas clinically diagnosed with keratoconus. It is important to note that the corneal surfaces within the normal, unilateral suspect, and suspect map groups exhibit similar shapes, but they significantly differ from the corneal surfaces observed in the clinical keratoconus group (refer to Figure 1 in [24] for visualization). In the process of reconstructing a corneal surface, ref. [24] utilized the Zernike regression model to fit the height measurements associated with the corneal surface. The height of the corneal surface at a specific radius r and angle θ is denoted as h ( r , θ ) , while h ^ ( r , θ ) represents the height estimated through the fitted model within the predefined region of interest. This region of interest spans from r = 0 to r = r 0 and from θ = 0 to θ = 2 π , with r 0 being a predetermined positive constant. To naturally represent each corneal surface, a feature vector is constructed, consisting of values h ^ ( r i , θ j ) , where i ranges from 1 to K and j ranges from 1 to L. These values are obtained by evaluating the fitted corneal surface h ^ ( r , θ ) at a grid of points defined as r i = r 0 i / K and θ j = 2 π j / L for i = 1 , , K and j = 1 , , L . For simplicity, we choose to set K = 20 and L = 100 , resulting in a feature vector with dimensions of 2000 for each corneal surface.
For simplicity, we put the fitted feature vectors for the complete corneal surface dataset collectively into a feature matrix with dimensions of 150 × 2000 . In this matrix, each row corresponds to a feature vector representing a corneal surface. Specifically, the initial 43 rows of the feature matrix correspond to observations from the normal group, sequentially followed by 14 rows from the unilateral suspect group, 21 rows from the suspect map group, and lastly, 72 rows from the clinical keratoconus group.
Our objective is to examine whether there are significant differences in the distributions among various corneal surface groups, referred to as multi-sample problems for the equality of distributions for high-dimensional data, given that the high-dimensional feature vectors represent the observations of the corneal surface data. In this application, we employ T P B , T R P , T W S , and T S R to address these problems. For both the parametric bootstrap and random permutation methods, as well as the energy test, we perform a total of N = 10,000 replicates to compute the associated p-values. For simplicity, we denote the normal, unilateral suspect, suspect map, and clinical keratoconus groups as NOR, UNI, SUS, and CLI, respectively.
Table 2 displays the results obtained from the application of four statistical tests, namely T P B , T R P , T W S , and T S R , to assess the equality of distributions among different corneal surface groups. A careful examination of these results yields several noteworthy insights. To begin with, when considering the comparison among the corneal surface groups labeled “NOR vs. UNI vs. SUS vs. CLI”, it is evident that all four tests reject the null hypothesis. This rejection signifies that there exists at least one significant difference among the distributions of these four corneal surface groups. Consequently, further investigation is warranted to identify which specific group or groups differ from the others. Secondly, it is important to highlight the outcome for the comparison involving “NOR vs. UNI vs. SUS”. In this case, none of the four tests reject the null hypothesis at the 5 % significance level. This outcome indicates that the normal, unilateral suspect, and suspect map groups share a similar distribution pattern. Thirdly, across the remaining three comparisons, all four tests consistently reject the null hypothesis. These results align with the observations depicted in Figure 1 of [24], which illustrates the distinctiveness of corneal surfaces within the clinical keratoconus group when compared to the other three groups. Fourthly, when focusing on the comparison “NOR vs. UNI vs. SUS”, it is worth noting that the p-values obtained from T P B , T R P , and T W S are quite similar, suggesting their comparable performance. This consistency in p-values is also reflected in the empirical sizes presented in Table 1. Lastly, when analyzing the cases involving CLI, it becomes evident that the p-values generated by T S R consistently exhibit larger values compared to those produced by the other three tests. This discrepancy implies that T S R may have a lower sensitivity in detecting distribution differences when compared to the other tests, indicating the potentially reduced statistical power in this real data example.
Notice that the test T P B is bootstrap-based and the tests T R P and T S R are permutation-based. This means that their p-values are obtained via bootstrapping or permutating numerous random samples to compute the associated p-values as described in Section 3.1 and Section 3.2. Thus, the p-values of these tests are random, i.e., they are different at different instances. However, the p-value of T W S remains fixed. In order to investigate this clearly, we performed 500 iterations of T P B , T R P , T W S , and T S R on the case “NOR vs. UNI vs. SUS”. The boxplots of the corresponding p-values of the four tests are presented in Figure 3. It is evident that the p-values obtained from T W S remain fixed, whereas those derived from T P B , T R P , and T S R exhibit variability. This contrast underscores the fact that p-values resulting from bootstrap-based or permutation-based tests indeed differ across various instances.

6. Concluding Remarks

Testing whether multiple high-dimensional samples adhere to the same distribution is a common area of research interest. This paper introduces and investigates a novel MMD-based test to address this question. The null distribution of this test is approximated using three methods: parametric bootstrap, random permutation, and the W–S χ 2 -approximation approach. Results from two simulation studies and a real data application demonstrate that the proposed test exhibits effective size control and superior statistical power compared to the energy test introduced by [12] when the differences among sample distributions are primarily related to covariance matrices rather than mean vectors. Thus, the proposed test is generally well suited for conducting multi-sample equal distribution testing on high-dimensional data. We particularly recommend its use in scenarios where distribution differences are associated with covariance matrices. Conversely, when distribution differences predominantly pertain to means, the energy test is a more powerful choice. However, in practice, determining whether distribution differences are related to means or covariance matrices can be challenging. Therefore, we suggest considering both the new test and the energy test as viable options. Nevertheless, it is important to note that implementing the proposed test comes with certain challenges. Both the parametric bootstrap and random permutation methods can be computationally intensive, leading to variable p-values across different applications. In contrast, the W–S χ 2 -approximation method offers computational efficiency and produces fixed p-values. However, its accuracy is limited as it solely relies on matching two cumulants of the test statistic under the null hypothesis.
An intriguing question arises naturally: Can we enhance the accuracy of the proposed test by matching three cumulants of the test statistic? Recent work by [18] suggests that this is indeed possible. However, deriving the third cumulant of the test statistic presents a current challenge and requires further investigation. Another aspect to consider is the choice of kernel width. While the paper opts for simplicity by utilizing the median distance between observed vectors in the pooled sample, it is worth exploring the kernel width choice recommended by [18] to potentially enhance the test’s statistical power. These avenues for future research promise exciting developments and warrant further exploration.

Author Contributions

Conceptualization, J.-T.Z. and A.A.C.; methodology, J.-T.Z., T.Z. and Z.P.O.; software, Z.P.O. and T.Z.; validation, J.-T.Z. and T.Z.; formal analysis, Z.P.O.; investigation, J.-T.Z. and Z.P.O.; resources, Z.P.O.; data curation, Z.P.O.; writing—original draft preparation, J.-T.Z. and Z.P.O.; writing—review and editing, J.-T.Z. and T.Z.; visualization, A.A.C. and T.Z.; supervision, J.-T.Z.; project administration, J.-T.Z.; funding acquisition, J.-T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Zhang’s research was partially funded by the National University of Singapore academic research grant (22-5699-A0001).

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://tandf.figshare.com/articles/dataset/Linear_hypothesis_testing_with_functional_data/6063026/1?file=10914914 (accessed on 20 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MMDMaximum Mean Discrepancy
EDFEmpirical Distribution Function
RKHSReproducing Kernel Hilbert Spaces
RBFRadial Basis Function
AREAverage Relative Error

Appendix A. Technical Proofs

Proof of (6).
We have
T n = α = 1 k n α x ¯ α x ¯ 2 = α = 1 k n α x ¯ α 2 2 x ¯ α , x ¯ + x ¯ 2 = α = 1 k n α x ¯ α 2 n x ¯ 2 = α = 1 k n α x ¯ α 2 α = 1 k β = 1 k n α n β n x ¯ α , x ¯ β = α = 1 k n α ( n n α ) n x ¯ α 2 2 1 α < β k n α n β n x ¯ α , x ¯ β = α β k n α n β n x ¯ α 2 2 1 α < β k n α n β n x ¯ α , x ¯ β = 1 α < β k n α n β n ( x ¯ α 2 + x ¯ β 2 2 x ¯ α , x ¯ β ) = 1 α < β k n α n β n x ¯ α x ¯ β 2 .
Proof of Theorem 1.
Under Condition C1, we have F 1 = = F k = F . Let y , y i . i . d . F . Under Condition C3, we have Mercer’s expansion (19). Through (14) and (20), we have
λ r E y [ ψ r ( y ) ] = X E y K ˜ ( y , y ) ψ r ( y ) d F ( y ) = 0 .
This, together with (20), implies that
E [ ψ r ( y ) ] = 0 whenever   λ r 0   and Var [ ψ r ( y ) ] = X ψ r 2 ( y ) d F ( y ) = 1 .
Set z r , α i = ψ r ( y α i ) , i = 1 , , n α ; α = 1 , , k . Under Condition C1, we have y α i i . i . d . F .
It follows that for a fixed r = 1 , 2 , , z r , α i , i = 1 , , n α ; α = 1 , , k are i.i.d. with mean 0 and variance 1. For different r, z r , α i , i = 1 , , n α ; α = 1 , , k are uncorrelated. Then, through (18) and (19), we have
V ˜ α α = 1 n α 2 i = 1 n α j = 1 n α r = 1 λ r z r , α i z r , α j = r = 1 λ r 1 n α 2 i = 1 n α j = 1 n α z r , α i z r , α j = r = 1 λ r z ¯ r , α 2 , and V ˜ α β = 1 n α n β i = 1 n α j = 1 n β r = 1 λ r z r , α i z r , β j = r = 1 λ r z ¯ r , α z ¯ r , β ,
where z ¯ r , α = n α 1 i = 1 n α z r , α i , α = 1 , , k ; r = 1 , 2 , . Through (17), we have
T ˜ n = r = 1 λ r α = 1 k n α ( n n α ) n z ¯ r , α 2 2 1 α < β k n α n β n z ¯ r , α z ¯ r , β = r = 1 λ r α = 1 k n α ( z ¯ r , α z ¯ r ) 2 = r = 1 λ r A n , r ,
where A n , r = α = 1 k n α ( z ¯ r , α z ¯ r ) 2 , r = 1 , 2 , and z ¯ r = n 1 α = 1 k n α z ¯ r , α .
Let φ X ( t ) = E ( e i t X ) denote the characteristic function of a random variable X. Set T ˜ n ( q ) = r = 1 q λ r A n , r . Then, we have | φ T ˜ n ( t ) φ T ˜ n ( q ) ( t ) | | t | E ( T ˜ n T ˜ n ( q ) ) 2 1 / 2 . For any given r = 1 , 2 , , through the central limit theorem, under Conditions C2 and C3, as n , we have n α z ¯ r , α L N ( 0 , 1 ) , α = 1 , , k . Set w n , r = [ n 1 z ¯ r , 1 , , n k z ¯ r , k ] . Then, as n , we have w n , r L w r N k ( 0 , I k ) and w r , r = 1 , 2 , are independently identically distributed. Set δ n , k = [ n 1 / n , , n k / n ] . Under Condition C2, we have δ n , k δ k = [ τ 1 , , τ k ] . Thus, G n = I k δ n , k δ n , k G = I k δ k δ k . Both G n and G are idempotent matrices with rank k 1 . It follows that
A n , r = w n , r G n w n , r L A r = w r G w r χ k 1 2 ,
which is a chi-squared distribution with k 1 degrees of freedom and A r , r = 1 , 2 , are independent. It follows that as n , we have Var ( A n , r ) = 2 ( k 1 ) [ 1 + o ( 1 ) ] and E ( A n , r ) = ( k 1 ) [ 1 + o ( 1 ) ] . Therefore, as n , we have
E ( T ˜ n T ˜ n ( q ) ) 2 = E r = q + 1 λ r A n , r 2 = Var r = q + 1 λ r A n , r + E 2 r = q + 1 λ r A n , r r = q + 1 Var ( λ r A n , r ) 2 + r = q + 1 E ( λ r A n , r ) 2 = ( k 2 1 ) r = q + 1 λ r 2 [ 1 + o ( 1 ) ] .
It follows that
| φ T ˜ n ( t ) φ T ˜ n ( q ) ( t ) | | t | ( k 2 1 ) 1 / 2 r = q + 1 λ r [ 1 + o ( 1 ) ] .
Let t be fixed. Under Condition C3 and (21), as q , we have r = q + 1 λ r 0 . Thus, for any given ϵ > 0 , there exist N 1 and Q 1 , depending on | t | and ϵ , such that as n > N 1 and q > Q 1 , we have
| φ T ˜ n ( t ) φ T ˜ n ( q ) ( t ) | ϵ .
For any fixed q > Q 1 , through (A2), as n , we have T ˜ n ( q ) L T ˜ ( q ) = d r = 1 q λ r A r , A r i . i . d . χ k 1 2 . Thus, there exists N 2 , depending on q and ϵ , such that as n > N 2 , we have
| φ T ˜ n ( q ) ( t ) φ T ˜ ( q ) ( t ) | ϵ .
Recall that T ˜ = r = 1 λ r A r , A r i . i . d . χ k 1 2 . Along the same lines as those for proving (A4), we can show that there exists Q 2 , depending on | t | and ϵ , such that as q > Q 2 , we have
| φ T ˜ ( q ) ( t ) φ T ˜ ( t ) | ϵ .
It follows from (A4)–(A6) that for any n max ( N 1 , N 2 ) and q max ( Q 1 , Q 2 ) , we have
| φ T ˜ n ( t ) φ T ˜ ( t ) | | φ T ˜ n ( t ) φ T ˜ n ( q ) ( t ) | + | φ T ˜ n ( q ) ( t ) φ T ˜ ( q ) ( t ) | + | φ T ˜ ( q ) ( t ) φ T ˜ ( t ) | 3 ϵ .
The convergence in distribution of T ˜ n to T ˜ follows as we can let ϵ 0 . □
Proof of Theorem 2.
Under Condition C1, let y , y i . i . d . F . Then, through (23), we have E ( V ˜ α α ) = n α 1 E [ K ˜ ( y , y ) ] and E ( V ˜ α β ) = 0 . Thus, we have
E ( T ˜ n ) = α = 1 k n α ( n n α ) n n α 1 E [ K ˜ ( y , y ) ] = ( k 1 ) E [ K ˜ ( y , y ) ] .
Through (23) again, we have
Var ( T ˜ n ) = α = 1 k n α ( n n α ) n 2 Var ( V ˜ α α ) + 4 1 α < β k n α n β n 2 Var ( V ˜ α β ) = α = 1 k n α 2 ( n n α ) 2 n 2 1 n α 3 Var [ K ˜ ( y , y ) ] + 2 ( n α 1 ) n α 3 E [ K ˜ 2 ( y , y ) ] + 4 1 α < β k n α 2 n β 2 n 2 1 n α n β E [ K ˜ 2 ( y , y ) ] = α = 1 k ( n n α ) 2 n 2 n α Var [ K ˜ ( y , y ) ] + 2 α = 1 k ( n n α ) 2 ( n α 1 ) n 2 n α + 2 1 α < β k n α n β n 2 E [ K ˜ 2 ( y , y ) ] = α = 1 k ( n n α ) 2 n 2 n α Var [ K ˜ ( y , y ) ] + 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α E [ K ˜ 2 ( y , y ) ] .
Proof of Theorem 3.
First, through (12), we have | K ˜ ( y , y ) | , | K ˜ ( y , y ) | 4 B K for all y , y R p . Thus, we have
Var [ K ˜ ( y , y ) ] E [ K ˜ 2 ( y , y ) ] 16 B K 2 , and E [ K ˜ 2 ( y , y ) ] 16 B K 2 .
Then, through Theorem 2, we have
Var ( T ˜ n ) = α = 1 k ( n n α ) 2 n 2 n α Var [ K ˜ ( y α 1 , y α 1 ) ] + 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α E [ K ˜ 2 ( y α 1 , y β 1 ) ] 16 B K 2 α = 1 k ( n n α ) 2 n 2 n α + 2 · 16 B K 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α = 16 B K 2 2 ( k 1 ) α = 1 k ( n n α ) 2 n 2 n α 32 B K 2 ( k 1 ) .
Finally, since x ¯ α , α = 1 , , k are independent, we have
Var ( S n ) = n ( 1 2 Δ ) α = 1 k n α 2 · σ α 2 n α = n 2 Δ α = 1 k n α n σ α 2 .
Furthermore, through the Cauchy–Schwarz inequality, we have
α = 1 k n α n σ α 2 = α = 1 k n α n E x α 1 μ α , h α h ¯ 2 α = 1 k n α n E x α 1 μ α 2 h α h ¯ 2 α = 1 k n α n h α h ¯ 2 E [ K ˜ ( y α 1 , y α 1 ) ] 4 B K α = 1 k n α n h α h ¯ 2 4 B K B h < .
Proof of Theorem 4.
Under the given conditions, through Theorems 2 and 3, we have
E ( T ˜ n ) = ( k 1 ) E [ K ˜ ( y α 1 , y α 1 ) ] , and Var ( T ˜ n ) 32 B K 2 ( k 1 ) ,
and as n ,
Var ( S n ) = n 2 Δ α = 1 k τ α σ α 2 [ 1 + o ( 1 ) ] ,
where τ α , α = 1 , , k are given in Condition C2, It follows that as n , we have E ( T ˜ n ) / Var ( S n ) 0 and Var ( T ˜ n ) / Var ( S n ) 0 . Thus, through the Markov inequality, as n , we have
Pr T ˜ n Var ( S n ) > ε = Pr T ˜ n Var ( S n ) 2 > ε 2 1 ε 2 · E ( T ˜ n 2 ) Var ( S n ) 0 ,
for all ε > 0 . Therefore, we have T ˜ n / Var ( S n ) P 0 and (a) is proved. To prove (b), notice that through the central limit theorem, as n α , we have
u α = n α x ¯ α μ α , h α h ¯ / σ α L N ( 0 , 1 ) , α = 1 , , k .
Since x ¯ α , α = 1 , , k are independent and S n = n Δ α = 1 k n α / n σ α u α , we have S n / Var ( S n ) L N ( 0 , 1 ) . To prove (c), notice that as n , through (25), (a) and (b), we have
T n n 2 Δ α = 1 k n α n h α h ¯ 2 2 Var ( S n ) = T ˜ n 2 Var ( S n ) + S n Var ( S n ) L N ( 0 , 1 ) .
Thus, as n , and through (A7), we have
Pr ( T n C ϵ ) = Pr T n n 2 Δ α = 1 k n α n h α h ¯ 2 2 Var ( S n ) C ϵ n 2 Δ α = 1 k n α n h α h ¯ 2 2 Var ( S n ) = 1 Φ C ϵ 2 n Δ α = 1 k n α n σ α 2 n 2 Δ α = 1 k n α n h α h ¯ 2 2 n Δ α = 1 k n α n σ α 2 [ 1 + o ( 1 ) ] = 1 Φ n Δ α = 1 k τ α h α h * 2 2 α = 1 k τ α σ α * 2 [ 1 + o ( 1 ) ] = Φ n Δ α = 1 k τ α h α h * 2 2 α = 1 k τ α σ α * 2 [ 1 + o ( 1 ) ] 1 ,
where h * = α = 1 k τ α h α and σ α * 2 = E x α 1 μ α , h α h * 2 . Thus, the theorem is proved. □

References

  1. Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks; Springer: New York, NY, USA, 2006. [Google Scholar]
  2. Friedman, J.H.; Rafsky, L.C. Multivariate Generalizations of the Wald–Wolfowitz and Smirnov Two-Sample Tests. Ann. Stat. 1979, 7, 697–717. [Google Scholar] [CrossRef]
  3. Schilling, M.F. Multivariate Two-Sample Tests Based on Nearest Neighbors. J. Am. Stat. Assoc. 1986, 81, 799–806. [Google Scholar] [CrossRef]
  4. Baringhaus, L.; Franz, C. On a new multivariate two-sample test. J. Multivar. Anal. 2004, 88, 190–206. [Google Scholar] [CrossRef]
  5. Rosenbaum, P.R. An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B 2005, 67, 515–530. [Google Scholar] [CrossRef]
  6. Biswas, M.; Mukhopadhyay, M.; Ghosh, A.K. A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 2014, 101, 913–926. [Google Scholar] [CrossRef]
  7. Li, J. Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 2018, 105, 529–546. [Google Scholar] [CrossRef]
  8. Chen, H.; Friedman, J.H. A New Graph-Based Two-Sample Test for Multivariate and Object Data. J. Am. Stat. Assoc. 2017, 112, 397–409. [Google Scholar] [CrossRef]
  9. Hall, P.; Tajvidi, N. Permutation Tests for Equality of Distributions in High-Dimensional Settings. Biometrika 2002, 89, 359–374. [Google Scholar] [CrossRef]
  10. Wei, S.; Lee, C.; Wichers, L.; Marron, J.S. Direction-Projection-Permutation for High-Dimensional Hypothesis Tests. J. Comput. Graph. Stat. 2016, 25, 549–569. [Google Scholar] [CrossRef]
  11. Ghosh, A.K.; Biswas, M. Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes. Test 2016, 25, 525–547. [Google Scholar] [CrossRef]
  12. Székely, G.J.; Rizzo, M.L. Testing for equal distributions in high dimension. InterStat 2004, 5, 1249–1272. [Google Scholar]
  13. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  14. Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2013, 41, 2263–2291. [Google Scholar] [CrossRef]
  15. Zhang, J.T.; Smaga, Ł. Two-sample test for equal distributions in separable metric space: New maximum mean discrepancy based approaches. Electron. J. Stat. 2022, 16, 4090–4132. [Google Scholar] [CrossRef]
  16. Zhou, B.; Ong, Z.P.; Zhang, J.T. A new MMD-based two-sample test for equal distributions in separable metric spaces. Manuscript 2023, in press.
  17. Balogoun, A.S.K.; Nkiet, G.M.; Ogouyandjou, C. k-Sample problem based on generalized maximum mean discrepancy. arXiv 2018, arXiv:1811.09103. [Google Scholar]
  18. Zhang, J.T.; Guo, J.; Zhou, B. Testing equality of several distributions in separable metric spaces: A maximum mean discrepancy based approach. J. Econom. 2022, in press. [CrossRef]
  19. Zhang, J.T.; Guo, J.; Zhou, B. Linear hypothesis testing in high-dimensional one-way MANOVA. J. Multivar. Anal. 2017, 155, 200–216. [Google Scholar] [CrossRef]
  20. Gretton, A.; Fukumizu, K.; Harchaoui, Z.; Sriperumbudur, B.K. A Fast, Consistent Kernel Two-Sample Test. In Advances in Neural Information Processing Systems 22; Curran Associates, Inc.: New York, NY, USA, 2009; pp. 673–681. [Google Scholar]
  21. Welch, B.L. The generalization of `student’s’ problem when several different population variances are involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
  22. Satterthwaite, F.E. An Approximate Distribution of Estimates of Variance Components. Biom. Bull. 1946, 2, 110–114. [Google Scholar] [CrossRef]
  23. Zhang, J.T. Two-Way MANOVA with Unequal Cell Sizes and Unequal Cell Covariance Matrices. Technometrics 2011, 53, 426–439. [Google Scholar] [CrossRef]
  24. Smaga, Ł.; Zhang, J.T. Linear Hypothesis Testing with Functional Data. Technometrics 2019, 61, 99–110. [Google Scholar] [CrossRef]
Figure 1. Simulation 1. The empirical powers (in %) of T P B , T R P , T W S , and T S R under different cases of ( p , n 1 , n 2 , n 3 , δ 1 , δ 2 ) : 1. (10, 20, 30, 40, 1.2, 0.6), 2. (10, 80, 120, 160, 0.75, 0.375), 3. (10, 160, 240, 320, 0.65, 0.325), 4. (100, 20, 30, 40, 1.3, 0.65), 5. (100, 80, 120, 160, 0.85, 0.425), 6. (100, 160, 240, 320, 0.72, 0.36), 7. (500, 20, 30, 40, 1.65, 0.825), 8. (500, 80, 120, 160, 1, 0.5), 9. (500, 160, 240, 320, 0.8, 0.4).
Figure 1. Simulation 1. The empirical powers (in %) of T P B , T R P , T W S , and T S R under different cases of ( p , n 1 , n 2 , n 3 , δ 1 , δ 2 ) : 1. (10, 20, 30, 40, 1.2, 0.6), 2. (10, 80, 120, 160, 0.75, 0.375), 3. (10, 160, 240, 320, 0.65, 0.325), 4. (100, 20, 30, 40, 1.3, 0.65), 5. (100, 80, 120, 160, 0.85, 0.425), 6. (100, 160, 240, 320, 0.72, 0.36), 7. (500, 20, 30, 40, 1.65, 0.825), 8. (500, 80, 120, 160, 1, 0.5), 9. (500, 160, 240, 320, 0.8, 0.4).
Mathematics 11 04374 g001
Figure 2. Simulation 2. The empirical powers (in %) of T P B , T R P , T W S , and T S R under different cases of ( p , n 1 , n 2 , n 3 , δ 1 , δ 2 ) : 1. (10, 20, 30, 40, 0.37, 0.185), 2. (10, 80, 120, 160, 0.195, 0.097), 3. (10, 160, 240, 320, 0.18, 0.09), 4. (100, 20, 30, 40, 0.142, 0.071), 5. (100, 80, 120, 160, 0.068, 0.034), 6. (100, 160, 240, 320, 0.044, 0.022), 7. (500, 20, 30, 40, 0.06, 0.03), 8. (500, 80, 120, 160, 0.028, 0.014), 9. (500, 160, 240, 320, 0.022, 0.011).
Figure 2. Simulation 2. The empirical powers (in %) of T P B , T R P , T W S , and T S R under different cases of ( p , n 1 , n 2 , n 3 , δ 1 , δ 2 ) : 1. (10, 20, 30, 40, 0.37, 0.185), 2. (10, 80, 120, 160, 0.195, 0.097), 3. (10, 160, 240, 320, 0.18, 0.09), 4. (100, 20, 30, 40, 0.142, 0.071), 5. (100, 80, 120, 160, 0.068, 0.034), 6. (100, 160, 240, 320, 0.044, 0.022), 7. (500, 20, 30, 40, 0.06, 0.03), 8. (500, 80, 120, 160, 0.028, 0.014), 9. (500, 160, 240, 320, 0.022, 0.011).
Mathematics 11 04374 g002
Figure 3. Boxplots of p-values of T P B , T R P , T W S , and T S R when applied to the case “NOR vs. UNI vs. SUS” of the corneal surface data 500 times.
Figure 3. Boxplots of p-values of T P B , T R P , T W S , and T S R when applied to the case “NOR vs. UNI vs. SUS” of the corneal surface data 500 times.
Mathematics 11 04374 g003
Table 1. Empirical sizes (in %) of Simulation 1.
Table 1. Empirical sizes (in %) of Simulation 1.
ρ = 0.1 ρ = 0.5 ρ = 0.9
Model p (n1, n2, n3) T PB T RP T WS T SR T PB T RP T WS T SR T PB T RP T WS T SR
( 20 , 30 , 40 ) 4.04.55.24.94.44.04.74.15.75.56.15.9
10 ( 80 , 120 , 160 ) 4.95.05.65.35.85.76.05.24.24.14.34.8
( 160 , 240 , 320 ) 5.15.15.24.94.04.24.43.94.65.44.94.3
( 20 , 30 , 40 ) 4.85.26.54.84.24.64.34.75.95.66.15.9
1100 ( 80 , 120 , 160 ) 4.64.44.93.94.85.15.35.35.15.35.35.5
( 160 , 240 , 320 ) 5.85.66.95.54.74.54.54.24.64.35.04.5
( 20 , 30 , 40 ) 5.05.45.84.94.34.44.74.83.84.44.54.5
500 ( 80 , 120 , 160 ) 4.95.35.84.35.05.25.35.45.05.35.55.3
( 160 , 240 , 320 ) 6.26.27.05.76.46.26.46.44.84.45.05.2
( 20 , 30 , 40 ) 4.95.56.65.74.44.55.04.95.65.05.75.6
10 ( 80 , 120 , 160 ) 5.65.66.45.83.63.54.04.55.45.35.34.9
( 160 , 240 , 320 ) 5.75.86.86.34.95.55.85.25.35.55.75.1
( 20 , 30 , 40 ) 4.95.15.95.05.45.55.55.33.94.54.74.3
2100 ( 80 , 120 , 160 ) 5.55.66.65.54.54.34.93.85.85.65.75.6
( 160 , 240 , 320 ) 5.15.35.64.96.46.56.87.05.04.84.84.1
( 20 , 30 , 40 ) 5.05.15.45.55.95.86.06.65.05.25.35.5
500 ( 80 , 120 , 160 ) 4.84.35.45.25.75.65.75.14.13.94.03.9
( 160 , 240 , 320 ) 4.24.35.04.94.64.44.84.45.45.75.45.9
( 20 , 30 , 40 ) 4.25.05.64.85.46.06.55.74.94.95.54.8
10 ( 80 , 120 , 160 ) 4.95.05.54.94.94.45.05.05.25.15.45.4
( 160 , 240 , 320 ) 4.54.65.54.95.05.05.45.06.05.95.85.8
( 20 , 30 , 40 ) 4.14.45.14.75.05.55.84.55.05.05.44.4
3100 ( 80 , 120 , 160 ) 3.13.54.04.54.85.25.04.45.05.05.05.0
( 160 , 240 , 320 ) 4.74.55.45.04.84.85.14.54.64.64.83.9
( 20 , 30 , 40 ) 4.94.75.46.05.04.65.44.54.14.14.53.9
500 ( 80 , 120 , 160 ) 5.65.86.15.54.34.24.54.74.85.05.15.2
( 160 , 240 , 320 ) 5.04.85.56.04.44.54.64.55.55.45.66.0
ARE9.049.3316.228.6710.6712.5211.5611.709.268.749.1911.56
Table 2. p-values (in %) for testing the distribution equality of corneal surface groups.
Table 2. p-values (in %) for testing the distribution equality of corneal surface groups.
Method
Case T PB T RP T WS T SR
NOR vs. UNI vs. SUS vs. CLI000.000040.09
NOR vs. UNI vs. SUS31.232.333.338.9
NOR vs. UNI vs. CLI0.0200.00120.25
NOR vs. SUS vs. CLI000.000010.01
UNI vs. SUS vs. CLI0.0100.00120.24
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ong, Z.P.; Chen, A.A.; Zhu, T.; Zhang, J.-T. Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach. Mathematics 2023, 11, 4374. https://doi.org/10.3390/math11204374

AMA Style

Ong ZP, Chen AA, Zhu T, Zhang J-T. Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach. Mathematics. 2023; 11(20):4374. https://doi.org/10.3390/math11204374

Chicago/Turabian Style

Ong, Zhi Peng, Aixiang Andy Chen, Tianming Zhu, and Jin-Ting Zhang. 2023. "Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach" Mathematics 11, no. 20: 4374. https://doi.org/10.3390/math11204374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop