A Robust High-Dimensional Test for Two-Sample Comparisons

Hasan Bulut; Soofia Iftikhar; Nosheen Faiz; Olayan Albalawi

doi:10.3390/axioms13090585

,

and

¹

Department of Statistics, Ondokuz Mayıs University, Samsun 55200, Turkey

²

Department of Statistics, Shaheed Benazir Bhutto Women University, Peshawar 25000, Pakistan

³

Department of Statistics, Abdul Wali Khan University, Mardan 23200, Pakistan

⁴

Department of Statistics, University of Tabuk, Tabuk 47713, Saudi Arabia

Axioms2024, 13(9), 585;https://doi.org/10.3390/axioms13090585

This article belongs to the Special Issue Computational Statistics and Its Applications

Version Notes

Order Reprints

Abstract

The Hotelling

T^{2}

statistic is used to compare the mean vectors of two independent multivariate Gaussian distributions. Nevertheless, this statistic is highly sensitive to outliers and is not suitable for high-dimensional datasets where the number of variables exceeds the sample size. This study introduces a robust permutation test based on the minimum regularized covariance determinant (MRCD) estimator to address these limitations of the two-sample Hotelling

T^{2}

statistic. Simulation studies were performed to evaluate the proposed test’s empirical size, power, and robustness. Additionally, the test was applied to both uncontaminated and contaminated Alzheimer’s Disease datasets. The findings from the simulations and real data examples provide clues that the proposed test can be effectively used with high-dimensional data without being impacted by outliers. Finally, an R function within the “MVTests” package was developed to implement the proposed test statistic on real-world data.

Keywords:

robust Hotelling test statistic; high-dimensional data; minimum regularized covariance estimators; permutation test; MVTests

MSC:

62G10; 62G35; 62H15; 62P10

1. Introduction

Let

X_{1} ~ N_{p} (μ_{1}, Σ_{1})

and

X_{2} ~ N_{p} (μ_{2}, Σ_{2})

be the distribution of two independent groups. We decide to test the hypothesis given in (1) to see whether the mean parameters of these groups, which have multivariate Gaussian distributions, are equal.

\begin{matrix} H_{0} : μ_{1} = μ_{2} \\ H_{1} : μ_{1} \neq μ_{2} \end{matrix}

(1)

If the scatter parameters of these multivariate distributions are homogeneous

(Σ_{1} = Σ_{2} = Σ)

and the common covariance matrix is unknown, two samples Hotelling

T^{2}

statistics which are given in Equation (2) is used to test the hypothesis defined in (1):

T_{H o t e l l i n g}^{2} = (\frac{n_{1} n_{2}}{n_{1} + n_{2}}) {({\bar{X}}_{1} - {\bar{X}}_{2})}^{T} S^{- 1} ({\bar{X}}_{1} - {\bar{X}}_{2}) ~ T_{n_{1} + n_{2} - 2}^{2}

(2)

where

n_{1}

and

n_{2}

are sample sizes of the first and second groups; moreover

{\bar{X}}_{1}

and

{\bar{X}}_{2}

are the sample mean vectors of the first and second groups, respectively. Also

{(.)}^{T}

is the transpose operator.

S

is the pooled covariance matrix and is calculated as below:

S = \frac{1}{n_{1} + n_{2} - 2} ((n_{1} - 1) S_{1} + (n_{2} - 1) S_{2})

(3)

where

S_{1}

is the sample covariance matrix of the first group and

S_{2}

is the sample covariance matrix of the second group. Let

T_{C}^{2}

be the calculated value of the

T_{H o t e l l i n g}^{2}

statistic given in Equation (2). When

T_{C}^{2} > T_{n_{1} + n_{2} - 2; α}^{2}

, we reject the null hypothesis [1,2].

The Hotelling

T^{2}

statistic is based on the classical mean vectors and covariance matrices. For this reason, it is heavily sensitive to outliers in the data. The term sensitive means that the estimators are affected by outliers in the data, which means that outliers in the data will cause their estimations to fail. The common solution to this problem is to use robust mean vectors and covariance matrices in the literature. Willems et al. [3] used the minimum covariance determinant (MCD) estimators, Çetin and Aktaş [4] used the minimum volume ellipsoid (MVE) estimators, Mokhtar et al. [5] used the M-estimators for a one-sample Hotelling

T^{2}

statistic instead of classical ones. Similarly, Todorov and Filzmoser [6] used the MCD estimators for one-way MANOVA.

Another important problem is that the

T_{H o t e l l i n g}^{2}

statistic given in Equation (2) can only be used for low-dimensional datasets, which are

p < n_{1} + n_{2} - 2

. Otherwise, the matrix

S

is singular and the precision matrix

(S^{- 1})

cannot be obtained. Therefore, the

T_{H o t e l l i n g}^{2}

cannot be calculated. There are many test statistics in the literature to overcome this problem.

Bai and Saranadasa [7] proposed the statistic in Equation (4). Because there is no precision matrix in the test statistics given in Equation (4), this statistic can be used in high-dimensional data. We can perform this statistic on any real data via the apval_Bai1996() function in the R package called “highmean” [8].

T_{B S}^{2} = \frac{(\frac{n_{1} n_{2}}{n_{1} + n_{2}}) {({\bar{X}}_{1} - {\bar{X}}_{2})}^{T} ({\bar{X}}_{1} - {\bar{X}}_{2}) - t r (S)}{\sqrt{\frac{2 n (n + 1)}{(n - 1) (n + 2)} \{t r (S^{2}) - \frac{{(t r (S))}^{2}}{n}\}}}

(4)

where

t r (.)

is a trace operator.

Srivastava and Du [9] suggested using a diagonal matrix instead of a covariance matrix. The diagonal elements of this diagonal matrix are variances in matrix

S

. Therefore, Srivastava and Du [9] proposed a test statistic that can be used in high-dimensional data by removing the covariance information. This statistic is given in Equation (5). The p-value for this statistic can be obtained by using the apval_Sri2008() function in the R package called “highmean” [8].

T_{S D}^{2} = \frac{(\frac{n_{1} n_{2}}{n_{1} + n_{2}}) {({\bar{X}}_{1} - {\bar{X}}_{2})}^{T} D_{S}^{- 1} ({\bar{X}}_{1} - {\bar{X}}_{2}) - \frac{n p}{n - 2}}{\sqrt{2 (t r (R^{2}) - p^{2} n^{- 1}) c_{p, n}}}

(5)

where

D_{S} = d i a g (s_{11}, s_{22}, \dots, s_{p p})

is a diagonal matrix, and

s_{i i}

is the

i_{t h}

diagonal element of matrix

S

. Moreover,

R = D_{S}^{- 1 / 2} S D_{S}^{- 1 / 2}

is the correlation matrix, and

c_{p, n} = 1 + \frac{t r (R^{2})}{\sqrt{p^{3}}}

.

Chen and Qin [10] suggested the test statistic given in Equation (6). The p-value for the

T_{C Q}^{2}

statistic can be obtained by using the apval_Chen2010() function in the R package called “highmean” [8].

T_{C Q}^{2} = \frac{\sum_{i \neq j}^{n_{1}} X_{1 i}^{T} X_{1 j}}{n_{1} (n_{1} - 1)} + \frac{\sum_{i \neq j}^{n_{2}} X_{2 i}^{T} X_{2 j}}{n_{2} (n_{2} - 1)} - 2 \frac{\sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} X_{1 i}^{T} X_{2 j}}{n_{1} n_{2}}

(6)

These test statistics lose power when the

μ_{1} - μ_{2}

vector contains many zero elements [11,12]. For this reason, Cai, Liu and Xia [11] proposed a supremum test statistic. The p-value for this supremum statistic can be calculated by using the apval_Cai2014() function in the R package called “highmean” [8].

T_{C L X}^{2} = (\frac{n_{1} n_{2}}{n_{1} + n_{2}}) \max_{1 \leq i \leq p} {({\bar{X}}_{1}^{(i)} - {\bar{X}}_{2}^{(i)})}^{2} / s_{i i}

(7)

The test statistics (

T_{B S}^{2}, T_{S D}^{2}, T_{C Q}^{2}, T_{C L X}^{2})

given in Equations (4)–(7) can be used to test the hypothesis in (2) with high-dimensional data. However, these test statistics are still sensitive to outliers because they are based on classical estimators.

As mentioned before, Willems, Pison, Rousseeuw and Van Aelst [3] and Todorov and Filzmoser [6] used the MCD estimators to obtain robust tests. However, it is well known that the MCD estimators can only be used for low-dimensional data. To overcome this lack of MCD estimators, Boudt et al. [13] developed the minimum regularized covariance determinant (MRCD) estimators. MRCD estimators can be obtained robustly in high-dimensional data, unlike MCD ones. Detailed information about MRCD estimators is given in Section 2. Bulut [14] used the MRCD estimators to propose a robust Hoteling

T^{2}

statistic for one sample.

The main purpose of this study is to propose a test that can be used in high-dimensional data and is not sensitive to outliers in the data. For this aim, we extend the approach developed by Bulut (2023b) to two-sample cases. Moreover, we aim to use our test procedure on real data by generating R functions in the “MVTests” package [15].

The remainder of the study is designed as follows. In Section 2, the MRCD estimators are introduced. Section 3 defines the proposed robust Hotelling

T^{2}

statistic and test procedure. In Section 4, a robust version of

T_{C L X}^{2}

proposed by Cai, Liu and Xia [11] is introduced. We perform a simulation study to compare the performance of our test procedure and alternative tests in Section 5. In Section 6, we perform tests on real example data to compare the performance of the tests. We introduce the R functions, which can be used to perform our test statistic in Section 7. Finally, we provide conclusions in Section 8.

2. Minimum Regularized Covariance Determinant (MRCD) Estimators

Willems, Pison, Rousseeuw and Van Aelst [3] and Todorov and Filzmoser [6] used the MCD estimators to obtain robust test statistics. However, these estimators cannot be used when

n < p

.

Boudt, Rousseeuw, Vanduffel and Verdonck [13] proposed Minimum Regularized Covariance Determinant (MRCD) estimators of location and scatter parameters without affecting outliers in high-dimensional data. The MRCD estimator has good breakdown point properties of the MCD estimator [13,14].

To calculate MRCD estimations, firstly, we standardize data by using the median and

Q_{n}

as the univariate location and scatter estimators [16]; then, we use the

T

target matrix. This matrix

T

is symmetric and positive definite. The regularized covariance matrix of any subset

H

, which is obtained from standardized

Z

data, is calculated as below:

K (H) = ρ T + (1 - ρ) c_{α} S_{Z} (H)

(8)

where

ρ

is the regularization parameter,

c_{α}

is the consistency factor defined by Croux and Haesbroeck [17], and

S_{Z} (H) = \frac{1}{h - 1} {(Z_{H} - μ_{Z} (H))}^{T} (Z_{H} - μ_{Z} (H)), μ_{Z} (H) = \frac{1}{h} Z_{H}^{T} 1_{h}

(9)

MRCD estimations are obtained from the subset

H_{M R C D}

which is obtained by solving the minimization problem given in Equation (10).

H_{M R C D} = \underset{H \in H}{argmin} [d e t {(K (H))}^{1 / p}]

(10)

where

H

is the set, which consists of all subsets with size

h

in data. Finally, the MRCD location and scatter estimators are obtained as given in Equations (11) and (12).

{\hat{μ}}_{M R C D} = V_{X} + D_{X} μ_{Z} (H_{M R C D})

(11)

{\hat{Σ}}_{M R C D} = D_{X} Q Λ^{1 / 2} [ρ I + (1 - ρ) c_{α} S_{w} (H_{M R C D})] Λ^{1 / 2} Q^{'} D_{X}

(12)

where

Λ

and

Q

are eigenvalues and eigenvectors matrices of

T

, respectively. Also,

S_{w} (H_{M R C D})

is calculated as below:

S_{w} (H_{M R C D}) = Λ^{- 1 / 2} Q^{'} S_{Z} (H_{M R C D}) Q Λ^{- 1 / 2}

(13)

More detailed information about MRCD estimators is available in [13]. In this study, we used the “rrcov” package in the R software for calculations regarding the MRCD estimators [18]. While we use the package, we assume that we know the outlier rate of the data. Moreover, we prefer to use the default values of the regularization parameter (rho) and target matrix. The function automatically calculates these values from the dataset.

3. Proposed Hotelling $T^{2}$ Test

The Hotelling

T^{2}

statistic, which is used to compare mean vectors of two independent groups, is given in Equation (2). As mentioned before, however, this statistic cannot be used in high-dimensional data and is sensitive to outliers in data. In this study, we propose a test statistic that can be used in high-dimensional data without being affected by outliers in data. This statistic is given in Equation (14).

T_{M R C D}^{2} = (\frac{n_{1} n_{2}}{n_{1} + n_{2}}) {(M_{1} - M_{2})}^{T} C^{- 1} (M_{1} - M_{2})

(14)

where

n_{1}

and

n_{2}

are the sample size of the first and second groups, respectively.

M_{1}

and

M_{2}

are MRCD estimations of location parameters of the first and second groups, respectively. The matrix C is a robust pooled covariance matrix and it is calculated as in Equation (15):

C = \frac{1}{n_{1} + n_{2} - 2} ((n_{1} - 1) S_{1}^{M R C D} + (n_{2} - 1) S_{2}^{M R C D})

(15)

where

S_{1}^{M R C D}

and

S_{2}^{M R C D}

are the MRCD covariance matrices of the first and second samples, respectively.

Because the finite-sample distribution of MRCD estimators is unknown, Bulut [14] used an approach distribution for a one-sample case. This approach is used by Willems, Pison, Rousseeuw and Van Aelst [3] and Todorov and Filzmoser [6] in low-dimensional data. However, we prefer a permutation test instead of any asymptotic distribution because the asymptotic distribution approach needs too much calculation time. As a result, we propose the robust permutation test to compare the mean vectors of two independent groups in high-dimensional data as below:

Calculate the MRCD estimations of the first and second groups.
Calculate the $T_{M R C D}^{2}$ value based on Equation (14). Let $T_{M R C D (c)}^{2}$ be this calculated value.
Compound all observations as one sample with size $n = n_{1} + n_{2}$ .
Under the null hypothesis, randomly separate observations such that $n_{1}$ observations are in the first group and $n_{2}$ observations are in the second group.
Calculate the $T_{M R C D (i)}^{2}$ $(i = 1, 2, \dots, N)$ value based on Equation (14) for the generated synthetic groups.
Repeat steps (iii–v) N times. Here, N is the permutation number.
Calculate the p-value as given in Equation (16):

$p v a l u e = \frac{# (T_{M R C D (c)}^{2} > T_{M R C D (i)}^{2})}{N}$

(16)

According to the test algorithm, we can calculate the p-value directly without any distribution assumption. Moreover, we can calculate this p-value in high-dimensional data. We reject the null hypothesis when the p-value is less than the significance level. This test procedure can be performed on any real example data with the R function introduced in Section 7.

Like other permutation tests, the performance of the proposed test statistics is based on the permutation number, and once the permutation number increases, the results will be more stable.

4. Robust CLX Test

Wang et al. [19] proposed a robust alternative to the test procedure suggested by Cai, Liu and Xia [11]. We show

T_{R C L X}^{2}

this test statistic for consistence.

T_{R C L X}^{2}

is defined in Equation (17):

T_{R C L X}^{2} = (\frac{n_{1} n_{2}}{n_{1} + n_{2}}) \max \{\frac{{({\bar{Z}}_{η_{1}, 1})}^{2}}{w_{11}}, \dots, \frac{{({\bar{Z}}_{η_{p}, p})}^{2}}{w_{p p}}\}

(17)

where

{\bar{Z}}_{η} = {({\bar{Z}}_{η_{1}, 1}, \dots, {\bar{Z}}_{η_{p}, p})}^{T} = Ω_{η} ({\bar{X}}_{η} - {\bar{Y}}_{η})

, and

Ω_{η} = {(w_{i j})}_{p \times p} = Σ_{η}^{- 1}

is the common precision matrix for trimmed mean vectors

\sqrt{n_{1}} {\bar{X}}_{η}

and

\sqrt{n_{2}} {\bar{Y}}_{η}

.

{\bar{X}}_{η}

and

{\bar{Y}}_{η}

are the trimmed mean vectors for the first and second groups, respectively.

Let a random sample be

X_{1}, \dots, X_{n}

and

X_{(i)}

is the

i_{t h}

order statistic of this sample. For the trimming level

η \in (0, 0.5)

, we can calculate the trimmed mean of this sample given by:

{\bar{X}}_{η} = \frac{1}{n - 2 n η} \sum_{i = n η + 1}^{n - n η} X_{(i)}

(18)

Therefore, the trimmed mean vector can be obtained by calculating all the trimmed means of the p-variables for each group. According to this, we can define the trimmed mean vectors for the first and second groups as

{\bar{X}}_{η} = {({\bar{X}}_{η_{1}, 1}, \dots, {\bar{X}}_{η_{p}, p})}^{T}

and

{\bar{Y}}_{η} = {({\bar{Y}}_{η_{1}, 1}, \dots, {\bar{Y}}_{η_{p}, p})}^{T}

, respectively.

As a result, we can reject the null hypothesis when

T_{R C L X}^{2} > 2 \ln (p) - l n [\ln (p)] + q_{α}

. Here,

q_{α}

is the

(1 - α)

th quantile of the Gumbel distribution. More detailed information about the

T_{R C L X}^{2}

test statistic is available from Wang, Lin and Tang [19].

5. Simulation Study

In this section, we design a simulation study to compare

T_{B S}^{2}, T_{S D}^{2}, T_{C Q}^{2}, T_{C L X}^{2}, T_{R C L X}^{2}

, and

T_{M R C D}^{2}

, which is our proposed test regarding the empirical size, power and robustness properties. We test the null hypothesis

H_{0} : μ_{1} = μ_{2}

in all simulation designs, and we take

μ_{2} = 0

in all cases. Here,

0

is a zero vector.

We randomly sample observations from the multivariate Gaussian distributions

N_{p} (μ_{1}, Σ)

for the first group and

N_{p} (μ_{2}, Σ)

for the second group. The common and homogeneity covariance matrix of these distributions is generated based on three different models used by Cai, Liu and Xia [11] as below:

Model 1: $Σ = σ_{i, j}$ $(i, j = 1, 2, \dots p)$ , where

$σ_{i, j} = \{\begin{matrix} 1 & , & i = j \\ 0.8 & , & 2 (k - 1) + 1 \leq i \neq j \leq 2 k, w h e r e k = 1, \dots, [p / 2] \\ 0 & , & o t h e r w i s e \end{matrix}$
Model 2: $= σ_{i, j}$ $(i, j = 1, 2, \dots p)$ , where $σ_{i, j} = {0.6}^{|i - j|}$ for $1 \leq i, j \leq p$ .
Model 3: $Σ = Ω^{- 1}$ , and $Ω = w_{i, j} (i, j = 1, 2, \dots p)$ , where $w_{i, i} = 2$ $(i = 1, 2, \dots, p)$ , $w_{i, i + 1} = 0.8 (i = 1, 2, \dots, p - 1)$ , $w_{i, i + 2} = 0.4 (i = 1, 2, \dots, p - 2)$ , $w_{i, i + 3} = 0.4 (i = 1, 2, \dots, p - 3)$ , $w_{i, i + 4} = 0.2 (i = 1, 2, \dots, p - 4)$ , $w_{i, j} = 0$ otherwise.

For all the simulation studies, we use sample size values

n_{1} = n_{2} = n = 10, 20, 30

, and variable numbers

p = 25, 50, 75

. Therefore, we study high-dimensional data in all cases.

5.1. Empirical Size (Type-1) of Tests

In this subsection, we perform a simulation study to compare the empirical size performance of

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

. We generate

m = 3000

random samples from the multivariate Gaussian distributions

N_{p} (μ_{1}, Σ)

and

N_{p} (μ_{2}, Σ)

, respectively. We set

μ_{1} = μ_{2} = 0

such that the null hypothesis is true in all cases. We obtain

Σ

based on Models 1–3. We test the null hypothesis

(H_{0} : μ_{1} = μ_{2})

with

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

for each sample, and we calculate the empirical size of all the tests as the ratio of rejecting the true null hypothesis. We reject the null hypothesis when

p < 0.05

. The results are given in Table 1.

Table 1. Empirical sizes of test statistics.

Table 1 compares the empirical sizes of the test statistics. We can see that the empirical sizes of

T_{M R C D}^{2}

are reasonably close to the nominal level of 5%. As the sample size increases, the empirical sizes of

T_{C L X}^{2}

become close to the nominal value, while the empirical sizes of

T_{R C L X}^{2}

go away from the nominal value. We can see in Table 1 that the

T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

statistics have similar empirical size values.

5.2. Power of Tests

We perform a simulation study to compare the powers of

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

in this subsection. We generate

m = 3000

random samples from the multivariate Gaussian distributions

N_{p} (μ_{1}, Σ)

and

N_{p} (μ_{2}, Σ)

, respectively. We set

μ_{1} \neq μ_{2}

such that the null hypothesis is false in all cases. In all cases,

μ_{2} = 0

. We define

μ_{1}

with two different approaches as below:

Fixed magnitude: $μ_{1} = μ_{1 j}$ $(j = 1, 2, \dots, p)$ , $μ_{1 j} = \mp \sqrt{2 \log (p) / n}$ for $j = 1, \dots, m$ with the equal probability, and $μ_{1 j} = 0$ $(j = m + 1, \dots, p)$ .
Varied magnitude: $μ_{1} = μ_{1 j}$ $(j = 1, 2, \dots, p)$ , $μ_{1 j}$ is generated from $U n i f o r m (- \sqrt{\frac{4 \log (p)}{n}}, \sqrt{\frac{4 \log (p)}{n}})$ for $j = 1, \dots, m$ , and $μ_{1 j} = 0$ $(j = m + 1, \dots, p)$ .

Here, we use two different m values of

m = ⌈ 0.05 p ⌉

and

m = ⌈ \sqrt{p} ⌉

. Also,

⌈ x ⌉

denotes the minimum integer which is greater than

x

. We obtain

Σ

based on Models 1–3. We test the null hypothesis

(H_{0} : μ_{1} = μ_{2})

with

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

for each sample. Then, we calculate the power of all the tests as the ratio of rejecting the false null hypothesis. We reject the null hypothesis when

p < 0.05

. The results are given in Table 2. According to Table 2, the power loss of the

T_{M R C D}^{2}

is acceptable because the datasets generated are not contaminated.

Table 2. Power of test statistics.

5.3. Robustness of Tests

In this subsection, we perform a simulation study to investigate the robustness performance of

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

. We generate 90% of random samples from the multivariate Gaussian distributions

N_{p} (0, Σ)

and

N_{p} (0, Σ)

, respectively. Moreover, we generate the remaining 10% of samples from the multivariate Gaussian distributions

N_{p} (μ_{1 . o u t}, Σ)

and

N_{p} (μ_{2 . o u t}, Σ)

, respectively. Here,

μ_{1 . o u t} = {[\sqrt{p}, 0, \dots, 0]}^{T}

and

μ_{2 . o u t} = {[- \sqrt{p}, 0, \dots, 0]}^{T}

. We obtain

Σ

based on Models 1–3. Therefore, we contaminate samples with a 10% outlier ratio.

We draw a plot to show the effect of outliers in Figure 1. We generate 27 random observations from the first population

N_{p} (0, Σ)

and 27 observations from the second population

N_{p} (0, Σ)

such that

p = 25

. Moreover, we generate three random observations for the first group from the population

N_{p} (μ_{1 . o u t}, Σ)

, and three observations for the second group from the population

N_{p} (μ_{2 . o u t}, Σ)

.

Σ

is generated based on Model-1 in each case in Figure 1. Also,

μ_{1 . o u t} = {[5, 0, \dots, 0]}^{T}

and

μ_{2 . o u t} = {[- 5, 0, \dots, 0]}^{T}

. We used the multidimensional scaling method to reduce the dimension from 25 to 2.

Figure 1. Scatter plots for instance outliers problem in high-dimensional data.

According to Figure 1, we can see that the outliers in the two groups move the equal mean vectors away from each other. For this reason, a non-robust test may reject the null hypothesis, which is true in real data, while a robust test must fail to reject the null hypothesis without being affected by outliers. In this section, we compare the robustness performance of each test.

We test the null hypothesis

(H_{0} : μ_{1} = μ_{2})

with

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

for each sample, and we reject the null hypothesis when

p < 0.05

. Before the datasets are contaminated, the null hypothesis is true. If any test statistic is robust to outliers in data, it must reject the null hypothesis with a percentage close to the nominal significance level (5%). For this reason, we calculate the robustness performance as the ratio of rejecting the true null hypothesis. The results are given in Table 3.

Table 3. Robustness of test statistics.

According to Table 3, it is abundantly clear that the

T_{M R C D}^{2}

statistic is robust to outliers, while other statistics are heavily sensitive to outliers. The

T_{R C L X}^{2}

statistic has similar performance with contaminated data and uncontaminated data but this performance is unacceptable because the empirical size values are lower than the nominal value of 5%.

6. Real Example

Alzheimer’s disease (AD) is a cognitive impairment disorder marked by memory loss and a decline in functional abilities beyond what is typical for a given age, making it the most prevalent cause of dementia in the elderly. Biologically, Alzheimer’s disease is linked to amyloid-β (Aβ) brain plaques and brain tangles associated with a form of the Tau protein [20].

While medical imaging can be useful in predicting the onset of the disease, there is also a growing interest in potential cost-effective fluid biomarkers that can be extracted from plasma or cerebrospinal fluid (CSF). Currently, there are several acknowledged non-imaging biomarkers, including protein levels of specific forms of the Aβ and Tau proteins and the Apolipoprotein E genotype. The Apolipoprotein E genotype has three primary variants: E2, E3, and E4, with the E4 allele most closely associated with AD [21,22].

In a clinical study conducted by Craig-Schapiro et al. [23] involving 333 patients, including those with mild but well-characterized cognitive impairment and healthy individuals, CSF samples were collected from all subjects. The study aimed to determine if individuals in the early stages of impairment could be distinguished from cognitively healthy individuals. They included in the analysis demographic attributes like age and gender, the Apolipoprotein E genotype, protein measurements of Aβ, Tau, and a phosphorylated version of Tau (referred to as pTau), protein measurements of 124 exploratory biomarkers, and clinical dementia scores [20]. This data is available with the name AlzheimerDisease in the “AppliedPredictiveModeling” package in R software [24].

In this subsection, we use only 124 protein measurements as biomarkers of AD. Therefore, we have

p = 124

variables in our data. Moreover, we use a subset that contains 106 high-risk patients with Apolipoprotein E genotype E3E4, as Wang, Lin and Tang [19] do.

n_{1} = 41

subjects of them are “impaired”, and

n_{2} = 65

subjects of them are “healthy”. Then, we standardized all variables. As a result, we have high-dimensional data, and we provide a correlation plot of this data in Figure 2.

Figure 2. Correlation plot matrix

(124 \times 124)

. Columns and rows were sorted with clustering methods.

We aim to compare the mean vectors of two groups (“impaired” vs. “healthy”). For this purpose, firstly, we assume that our data does not contain outliers, and so we test the null hypothesis

(μ_{i m p a r i e d} = μ_{h e a l t h y})

with

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

. The results of the tests are given in the “Clean Data” part of Table 4. According to this part, all the tests reject the null hypothesis.

Table 4. Test results of Alzheimer’s Disease data.

After analysis of the clean data, we contaminate 10% of the data such that the difference between the mean vectors decreases. For this aim, we determine the last four observations in Impaired group-1 and the last six observations in Healthy group-2 such that the mean values of all the variables in group-1 are equal to 2, and the mean values of all the variables in group-2 are equal to 1.8. Therefore, the difference between the mean vectors of the groups is 0.2 for all variables. Despite the mean vectors of uncontaminated data in each group initially differing, the introduction of outliers has mitigated this disparity. For this reason, the appropriate decision for any test is to reject the null hypothesis without being affected by outliers we added into the groups.

We test these contaminated data again to compare the mean vectors with

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

. The results of the tests are given in the “Contaminated Data” part of Table 4. According to Table 4,

T_{M R C D}^{2}

still rejects the null hypothesis without being affected by outliers, whereas the other tests fail to reject the null hypothesis and are affected by outliers. These results show that

T_{M R C D}^{2}

is a robust test statistic to compare two mean vectors in high-dimensional datasets, while the other test statistics are not robust enough for this purpose.

7. Software Availability

We construct a function RperT2() to perform the proposed robust permutational test on real datasets in the R package called “MVTests” [15]. In this function, the data matrix of the first group must be entered into X1, and the data matrix of the second group must be entered into X2. The alpha argument determines the trimming ratio and must take a value from interval 0–1 (default; alpha = 0.75). The N argument determines the iteration number (default; N = 100). We give an example in Appendix A to demonstrate how to use this function to perform the proposed test on data in R.

8. Conclusions

In multivariate inference, the two-sample Hotelling

T^{2}

statistic is popular. However, this statistic cannot be used for contaminated or high-dimensional data.

There are many studies in the literature that demonstrate that Hotelling

T^{2}

statistics are not sensitive to outliers. The common approach in these studies is to use robust location and scatter estimation instead of classical ones [4,5,6]. However, these statistics cannot be used in high-dimensional data.

On the other hand, there are also many studies that obtain

T^{2}

statistics, which can be used in high-dimensional data in the literature [9,10,11,12]. However, these statistics are sensitive to outliers in data.

Wang, Lin and Tang [19] proposed a robust test statistic that can be used in high-dimensional data. However, their statistic is useful for cell-wise contaminated data.

In this study, we propose a robust test statistic for high-dimensional data. This statistic is based on MRCD estimations. Because the finite-sample distribution of MRCD estimators is unknown, we propose using a permutation test without needing any asymptotic distribution.

We perform simulation studies to compare the empirical size, power and robustness performances of

T_{M R C D}^{2}

,

T_{R C L X}^{2}

,

T_{C L X}^{2}, T_{B S}^{2}, T_{S D}^{2},

and

T_{C Q}^{2}

in clean and row-wise contaminated data. According to the simulation studies, the empirical sizes of

T_{M R C D}^{2}

are close to the nominal level (5%), the powers of

T_{M R C D}^{2}

are acceptable, and

T_{M R C D}^{2}

is not sensitive to outliers in data.

We perform tests on clean and contaminated Alzheimer’s Disease data. In the clean case, all tests reject the null hypothesis. In the contaminated case, however, only

T_{M R C D}^{2}

rejects the null hypothesis, unlike other statistics.

Finally, we construct the RperT2() function to perform our proposed test on real data in the R package entitled MVTests [15]. We believe that our proposed statistic is a valuable contribution to multivariate inference for high-dimensional data.

In high-dimensional datasets, the missing values problem is another problem similar to the outlier one. Future studies will focus on how the proposed test statistic can cope with this problem in the case of missing observations in high-dimensional data.

Author Contributions

Conceptualization, H.B.; methodology, H.B.; software, H.B.; validation, S.I., N.F. and O.A.; formal analysis, H.B.; investigation, H.B. and N.F.; resources, S.I.; data curation, O.A.; writing—original draft preparation, H.B. and S.I.; writing—review and editing, H.B., N.F. and O.A.; visualization, H.B.; supervision, H.B.; funding acquisition, N.F. and O.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We used synthetic datasets for simulations and Alzheimer’s Disease datasets for real examples in this study. The Alzheimer’s Disease data is available with the name AlzheimerDisease in the “AppliedPredictiveModeling” package in R software [24].

Acknowledgments

The authors thank the editors and four referees for their helpful, constructive comments, which have helped improve the quality and presentation of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

RperT2() function in the MVTests package [15] performs the proposed test on real high-dimensional data. The general usage of the function is as below:

RperT2 (X1, X2, alpha = 0.75, N = 100)

Here, X1 and X2 is the data matrix for the first and second groups, respectively. Also, they must be matrix or data.frame. alpha is numeric parameter controlling the size of the subsets over which the determinant is minimized. The allowed values are between 0.5 and 1 and the default is 0.75. Finally, N is the permutation number and its default value is 100.

To give an example, firstly, we generate random high dimensional data in R as below:

> x<-mvtnorm::rmvnorm (n = 10, sigma = diag(20), mean = rep(0,20))

> y<-mvtnorm::rmvnorm (n = 10, sigma = diag(20), mean = rep(1,20))

To generate random data from multivariate normal distribution, we use the rmvnorm() function in the package mvtnorm in R. It is seen that both samples have 10 observations and 20 variables. Therefore, we can say the data is high-dimensional. Moreover, we can see the mean vector of first group is

{[0, 0, \dots, 0]}_{20 \times 1}

and the mean vector of second group is

{[1, 1, \dots, 1]}_{20 \times 1}

. The covariance matrices of both groups are the same as the identity matrix. As a result, we expect to find a statistically significant difference between group means.

We can perform the proposed test on this data with default alpha and N values as below:

> install.packages(“MVTests”)

> library(MVTests)

> RperT2(X1 = x, X2 = y)

$T2

TR2

419.7017

$p.value

[1] 0

Function returns us

T_{M R C D (c)}^{2}

and the p-value. Because the p-value is < 0.001, we reject the null hypothesis with a 99.9% confidence level.

Finally, we can perform the proposed test on this data with different alpha and N values as below:

> install.packages(“MVTests”)

> library(MVTests)

> RperT2(X1 = x, X2 = y, alpha = 0.9, N = 300))

$T2

TR2

378.2798

$p.value

[1] 0

Function returns us

T_{M R C D (c)}^{2}

and the p-value. Because the p-value is <0.001, we reject the null hypothesis with a 99.9% confidence level.

References

Rencher, A.C. Methods of Multivariate Analysis; John Willey & Sons, Inc.: Montreal, QC, Canada, 2002. [Google Scholar]
Bulut, H. Multivariate Statistical Methods with R Applications, 2nd ed.; Nobel Academic Publishing: Ankara, Turkey, 2023. [Google Scholar]
Willems, G.; Pison, G.; Rousseeuw, P.J.; Van Aelst, S. A Robust Hotelling Test. Metrika 2002, 55, 125–138. [Google Scholar] [CrossRef]
Cetin, M.; Altunay, S.A. Hotelling’s T2 Statistic Based on Minimum-Volume-Ellipsoid Estimator. Gazi Univ. J. Sci. 2003, 16, 691–695. [Google Scholar]
Mokhtar, M.A.A.M.; Yusoff, N.S.; Liang, C.Z. Robust Hotelling’s T2 Statistic Based on M-Estimator. J. Phys. Conf. Ser. 2021, 1988, 012116. [Google Scholar] [CrossRef]
Todorov, V.; Filzmoser, P. Robust Statistic for the One-Way Manova. Comput. Stat. Data Anal. 2010, 54, 37–48. [Google Scholar] [CrossRef]
Bai, Z.; Saranadasa, H. Effect of High Dimension: By an Example of a Two Sample Problem. Stat. Sin. 1996, 6, 311–329. [Google Scholar]
Lin, L.; Pan, W. Highmean: Two-Sample Tests for High-Dimensional Mean Vectors; R Package Version 3.0; 2016. Available online: https://cran.r-project.org/web/packages/highmean/index.html (accessed on 23 August 2024).
Srivastava, M.S.; Du, M. A Test for the Mean Vector with Fewer Observations than the Dimension. J. Multivar. Anal. 2008, 99, 386–402. [Google Scholar] [CrossRef]
Chen, S.X.; Qin, Y.-L. A Two-Sample Test for High-Dimensional Data with Applications to Gene-Set Testing. Ann. Stat. 2010, 38, 808–835. [Google Scholar] [CrossRef]
Cai, T.T.; Liu, W.D.; Xia, Y. Two-Sample Test of High Dimensional Means under Dependence. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 349–372. [Google Scholar]
Xu, G.; Lin, L.; Wei, P.; Pan, W. An Adaptive Two-Sample Test for High-Dimensional Means. Biometrika 2016, 103, 609–624. [Google Scholar] [CrossRef] [PubMed]
Boudt, K.; Rousseeuw, P.J.; Vanduffel, S.; Verdonck, T. The Minimum Regularized Covariance Determinant Estimator. Stat. Comput. 2020, 30, 113–128. [Google Scholar] [CrossRef]
Bulut, H. A Robust Hotelling Test Statistic for One Sample Case in High Dimensional Data. Commun. Stat.-Theory Methods 2023, 52, 4590–4604. [Google Scholar] [CrossRef]
Bulut, H. An R Package for Multivariate Hypothesis Tests: Mvtests. Technol. Appl. Sci. 2019, 14, 132–138. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Croux, C. Alternatives to the Median Absolute Deviation. J. Am. Stat. Assoc. 1993, 88, 1273–1283. [Google Scholar] [CrossRef]
Croux, C.; Haesbroeck, G. Influence Function and Efficiency of the Minimum Covariance Determinant Scatter Matrix Estimator. J. Multivar. Anal. 1999, 71, 161–190. [Google Scholar] [CrossRef]
Todorov, V.; Filzmoser, P. An Object-Oriented Framework for Robust Multivariate Analysis. J. Stat. Softw. 2010, 32, 1–47. [Google Scholar] [CrossRef]
Wang, W.; Lin, N.; Tang, X. Robust Two-Sample Test of High-Dimensional Mean Vectors under Dependence. J. Multivar. Anal. 2019, 169, 312–329. [Google Scholar] [CrossRef]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013; Volume 26. [Google Scholar]
Kim, J.; Basak, J.M.; Holtzman, D.M. The Role of Apolipoprotein E in Alzheimer’s Disease. Neuron 2009, 63, 287–303. [Google Scholar] [CrossRef] [PubMed]
Bu, G. Apolipoprotein E and Its Receptors in Alzheimer’s Disease: Pathways, Pathogenesis and Therapy. Nat. Rev. Neurosci. 2009, 10, 333–344. [Google Scholar] [CrossRef] [PubMed]
Craig-Schapiro, R.; Kuhn, M.; Xiong, C.; Pickering, E.H.; Liu, J.; Misko, T.P.; Perrin, R.J.; Bales, K.R.; Soares, H.; Fagan, A.M.; et al. Multiplexed Immunoassay Panel Identifies Novel Csf Biomarkers for Alzheimer’s Disease Diagnosis and Prognosis. PLoS ONE 2011, 6, e18850. [Google Scholar] [CrossRef] [PubMed]
Kuhn, M.; Johnson, K. Appliedpredictivemodeling: Functions and Data Sets for ‘Applied Predictive Modeling’; R Package Version 1. 2018. Available online: https://cran.r-project.org/web/packages/AppliedPredictiveModeling/index.html (accessed on 23 August 2024).

Figure 1. Scatter plots for instance outliers problem in high-dimensional data.

Figure 2. Correlation plot matrix

(124 \times 124)

. Columns and rows were sorted with clustering methods.

Figure 2. Correlation plot matrix

(124 \times 124)

. Columns and rows were sorted with clustering methods.

Table 1. Empirical sizes of test statistics.

Models	n	p	$T_{M R C D}^{2}$	$T_{R C L X}^{2}$	$T_{C L X}^{2}$	$T_{B D}^{2}$	$T_{S D}^{2}$	$T_{C Q}^{2}$
Model-1	10	25	4.20	1.40	10.43	6.73	6.30	6.73
	20	50	4.40	0.90	7.33	6.03	5.30	6.03
	30	75	5.67	0.33	6.43	5.33	4.37	5.33
Model-2	10	25	4.90	4.20	10.90	6.63	6.60	6.63
	20	50	5.90	1.70	7.53	6.37	4.97	6.37
	30	75	4.00	1.33	6.53	6.63	5.17	6.63
Model-3	10	25	5.20	3.00	11.50	7.13	6.73	7.13
	20	50	5.60	4.40	8.26	6.63	5.60	6.63
	30	75	5.67	4.20	7.23	5.90	5.40	5.90

Table 2. Power of test statistics.

Models	n	p	Signal	m	$T_{M R C D}^{2}$	$T_{R C L X}^{2}$	$T_{C L X}^{2}$	$T_{B D}^{2}$	$T_{S D}^{2}$	$T_{C Q}^{2}$
Model-1	10	25	fixed	$⌈ 0.05 p ⌉$	18.20	18.60	29.00	19.33	18.17	19.33
			fixed	$⌈ \sqrt{p} ⌉$	50.00	34.30	48.10	49.17	44.30	49.17
			varied	$⌈ 0.05 p ⌉$	13.60	13.00	23.40	16.00	14.93	16.00
			varied	$⌈ \sqrt{p} ⌉$	31.60	20.40	42.43	33.47	31.13	33.47
	20	50	fixed	$⌈ 0.05 p ⌉$	25.40	37.20	30.73	24.27	20.83	24.27
			fixed	$⌈ \sqrt{p} ⌉$	80.40	75.40	57.20	69.63	63.93	69.63
			varied	$⌈ 0.05 p ⌉$	21.40	23.80	26.80	17.57	15.23	17.57
			varied	$⌈ \sqrt{p} ⌉$	53.20	52.60	49.80	45.43	40.10	45.43
	30	75	fixed	$⌈ 0.05 p ⌉$	41.40	64.60	34.40	29.63	24.80	29.63
			fixed	$⌈ \sqrt{p} ⌉$	83.80	89.80	59.20	71.67	67.03	71.67
			varied	$⌈ 0.05 p ⌉$	27.40	38.80	29.63	21.10	17.60	21.10
			varied	$⌈ \sqrt{p} ⌉$	54.60	62.40	50.40	46.90	41.57	46.90
Model-2	10	25	fixed	$⌈ 0.05 p ⌉$	17.70	15.40	28.20	18.07	16.03	18.07
			fixed	$⌈ \sqrt{p} ⌉$	51.50	34.20	49.70	43.77	39.30	43.77
			varied	$⌈ 0.05 p ⌉$	12.60	12.00	23.60	14.27	13.30	14.27
			varied	$⌈ \sqrt{p} ⌉$	34.60	30.60	43.60	27.40	24.83	27.40
	20	50	fixed	$⌈ 0.05 p ⌉$	29.40	27.40	30.60	19.80	16.40	19.80
			fixed	$⌈ \sqrt{p} ⌉$	83.00	56.60	58.00	61.47	54.33	61.47
			varied	$⌈ 0.05 p ⌉$	19.00	19.20	28.00	16.17	13.50	16.17
			varied	$⌈ \sqrt{p} ⌉$	56.80	42.80	51.60	40.07	34.73	40.07
	30	75	fixed	$⌈ 0.05 p ⌉$	40.00	41.80	36.57	24.77	20.80	24.77
			fixed	$⌈ \sqrt{p} ⌉$	84.80	73.00	58.00	64.57	58.07	64.57
			varied	$⌈ 0.05 p ⌉$	26.40	29.60	30.80	16.77	13.47	16.77
			varied	$⌈ \sqrt{p} ⌉$	57.00	57.00	52.40	40.10	33.90	40.10
Model-3	10	25	fixed	$⌈ 0.05 p ⌉$	25.90	46.20	44.40	31.83	31.83	31.83
			fixed	$⌈ \sqrt{p} ⌉$	58.40	47.30	69.07	77.20	74.60	77.20
			varied	$⌈ 0.05 p ⌉$	16.60	20.80	36.40	22.33	22.87	22.33
			varied	$⌈ \sqrt{p} ⌉$	41.40	38.80	56.80	52.13	50.97	52.13
	20	50	fixed	$⌈ 0.05 p ⌉$	36.80	40.60	53.60	41.53	41.00	41.53
			fixed	$⌈ \sqrt{p} ⌉$	83.00	72.40	81.80	94.23	92.60	94.23
			varied	$⌈ 0.05 p ⌉$	22.60	32.40	43.33	28.43	27.87	28.43
			varied	$⌈ \sqrt{p} ⌉$	56.20	56.60	69.20	70.90	68.90	70.90
	30	75	fixed	$⌈ 0.05 p ⌉$	41.40	55.00	62.60	51.80	49.60	51.80
			fixed	$⌈ \sqrt{p} ⌉$	86.80	82.20	84.43	95.33	94.23	95.33
			varied	$⌈ 0.05 p ⌉$	28.60	45.80	50.40	32.10	31.10	32.10
			varied	$⌈ \sqrt{p} ⌉$	59.00	68.80	73.43	73.33	71.03	73.33

Table 3. Robustness of test statistics.

Models	n	p	$T_{M R C D}^{2}$	$T_{R C L X}^{2}$	$T_{C L X}^{2}$	$T_{B D}^{2}$	$T_{S D}^{2}$	$T_{C Q}^{2}$
Model-1	10	25	5.60	1.00	10.47	9.73	7.57	9.73
	20	50	4.67	1.67	7.67	26.37	8.13	26.37
	30	75	6.33	0.00	6.33	59.47	8.40	59.47
Model-2	10	25	4.00	3.00	10.40	9.47	6.97	9.47
	20	50	5.67	1.00	8.33	22.37	7.70	22.37
	30	75	4.33	0.33	6.83	52.90	7.63	52.90
Model-3	10	25	5.20	2.20	12.40	9.93	7.57	9.93
	20	50	6.00	2.33	11.67	37.00	6.93	37.00
	30	75	6.33	4.33	7.07	84.13	8.57	84.13

Table 4. Test results of Alzheimer’s Disease data.

Test	Clean Data		Contaminated Data
Test	Test Statistic	p-Value	Test Statistic	p-Value
$T_{M R C D}^{2}$	251.699	0.000	266.378	0.000
$T_{R C L X}^{2}$	17.839	0.038	0.280	0.927
$T_{C L X}^{2}$	26.923	0.000	7.702	0.492
$T_{B S}^{2}$	8.852	0.000	−0.612	0.730
$T_{S D}^{2}$	6.676	0.000	−0.262	0.603
$T_{C Q}^{2}$	8.864	0.000	−0.552	0.710

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Robust High-Dimensional Test for Two-Sample Comparisons

Abstract

1. Introduction

2. Minimum Regularized Covariance Determinant (MRCD) Estimators

3. Proposed Hotelling $T^{2}$ Test

4. Robust CLX Test

5. Simulation Study

5.1. Empirical Size (Type-1) of Tests

5.2. Power of Tests

5.3. Robustness of Tests

6. Real Example

7. Software Availability

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

A Robust High-Dimensional Test for Two-Sample Comparisons

Abstract

1. Introduction

2. Minimum Regularized Covariance Determinant (MRCD) Estimators

3. Proposed Hotelling T 2 Test

4. Robust CLX Test

5. Simulation Study

5.1. Empirical Size (Type-1) of Tests

5.2. Power of Tests

5.3. Robustness of Tests

6. Real Example

7. Software Availability

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

3. Proposed Hotelling $T^{2}$ Test