MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates

Xu, Shaohua; Liu, Songnan; Zhou, Yongdao

doi:10.3390/e26121023

Open AccessArticle

MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates

by

Shaohua Xu

¹

,

Songnan Liu

^2,3 and

Yongdao Zhou

^1,*

¹

National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases (NITFID), School of Statistics and Data Science, Nankai University, Tianjin 300071, China

²

School of Mathematical Sciences, Inner Mongolia University, Hohhot 010021, China

³

College of Science, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(12), 1023; https://doi.org/10.3390/e26121023

Submission received: 6 November 2024 / Revised: 23 November 2024 / Accepted: 25 November 2024 / Published: 26 November 2024

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Data-driven decision-making has become crucial across various domains. Randomization and re-randomization are standard techniques employed in controlled experiments to estimate causal effects in the presence of numerous pre-treatment covariates. This paper quantifies the worst-case mean squared error of the difference-in-means estimator as a generalized discrepancy of covariates between treatment and control groups. We demonstrate that existing randomized or re-randomized experiments utilizing Monte Carlo methods are sub-optimal in minimizing this generalized discrepancy. To address this limitation, we introduce a novel optimal deterministic experiment based on quasi-Monte Carlo techniques, which effectively minimizes the generalized discrepancy in a model-independent manner. We provide a theoretical proof indicating that the difference-in-means estimator derived from the proposed experiment converges more rapidly than those obtained from completely randomized or re-randomized experiments using Mahalanobis distance. Simulation results illustrate that the proposed experiment significantly reduces covariate imbalances and estimation uncertainties when compared to existing randomized and deterministic approaches. In summary, the proposed experiment serves as a reliable and effective framework for controlled experimentation in causal inference.

Keywords:

average treatment effect; covariate balance; generalized discrepancy; quasi-Monte Carlo

1. Introduction

In recent years, data-driven decision-making has been routinely employed in various contexts, including evaluating the efficacy of medical treatments, vaccinations, training programs, marketing campaigns, and other interventions. The crux of effective decision-making lies in the ability to manage uncertainty and make informed choices based on available information. The concept of entropy, which quantifies uncertainty in a probability distribution, plays a crucial role in understanding how information is generated and utilized in these contexts.

Suppose an experimenter has collected data on n units, along with baseline covariate information. The question arises: how should treatments be allocated based on these pre-treatment covariates to provide precise causal estimates of treatment effects on potential outcomes? Here, the balance of information—essentially the reduction of uncertainty—across treatment groups becomes vital for drawing valid conclusions.

Randomized experiments, including completely randomized experiments and re-randomized experiments [1], are commonly employed as control methods. However, ref. [2] argued that a deterministic experiment can strictly dominate all randomized experiments in minimizing estimation error when continuous covariates are present. The ability of a deterministic design to minimize covariate imbalance reflects its potential to reduce uncertainty in treatment effect estimates, thereby leading to more reliable causal inferences. This deterministic experiment seeks to balance the covariate distribution across treatments, resulting in fairer experiments relative to random assignments. Consequently, deterministic experiments often appear more appealing in practice.

Ref. [2] developed a Bayesian optimal deterministic experiment (BODE) by minimizing the expected mean squared error (MSE) risk of the estimated treatment effect across assumed stochastic processes for the potential outcomes. However, identifying a BODE requires specifying approximate prior distributions for the potential outcomes, which introduces additional uncertainty in the modeling process.

In this paper, we introduce a novel type of optimal deterministic experiment, termed the minimax optimal deterministic experiment (MODE). The design criterion to be minimized is the maximum MSE risk associated with treatment assignment over a class of mean and variance functions of potential outcomes. This model-independent approach inherently reduces uncertainty by avoiding reliance on specific parametric assumptions. The bias component of the maximum risk is proportional to the generalized discrepancy, which quantifies covariate imbalance by measuring the difference between the empirical distributions of covariates in treatment and control groups. This generalized discrepancy serves as a critical measure of information loss, providing insight into the uncertainty surrounding treatment assignments.

Quasi-Monte Carlo methods are employed to generate the MODE. From a covariate balance perspective, the covariates in a randomized experiment can be viewed as a Monte Carlo sample from the covariate population. As a result, existing randomized experiments are sub-optimal, and the proposed MODE exhibits reduced covariate imbalance and lower estimation MSE. Typically, the MODE decreases the MSE of the difference-in-means estimator based on the completely randomized experiments (CRE) from

O_{p} (n^{- 1})

to

o (n^{- 1})

when potential outcomes are fixed. Thus, for most online control experiments prioritizing estimation accuracy and the effective management of uncertainty, the proposed MODE is preferred. Additionally, we provide re-randomization relaxation of the MODE for situations where randomization is necessary due to practical constraints.

The remainder of this paper is organized as follows. In Section 1.1, we review related experimental designs for causal inference. Our main contributions are summarized in Section 1.2. In Section 2, we formulate the problem of finding a MODE. Section 3 derives the large sample properties of the MODE. Section 4 theoretically compares the MODE with existing randomized experiments. Section 5 compares the finite sample performance of the MODE with that of existing experiments across different potential outcomes models. Finally, we conclude this paper and discuss future works in Section 6. For clarity, all proofs are provided in Appendix B.

1.1. Related Works

Completely randomized experiments are considered the gold standard for estimating causal effects, as randomization balances all potential confounding variables—both observed and unobserved—on average, thereby providing reliable estimates of treatment effects. However, as highlighted by [3], in a particular randomized experiment, covariates between treatment groups may become significantly unbalanced, leading to increased estimation error and greater uncertainty about the treatment effects. As pointed out by [1], with 10 independent covariates, the probability of at least one covariate showing a significant difference at the 5% significance level between treatment and control groups is about 40%. Similarly, ref. [4] emphasized that achieving covariate balance (e.g., through matching) enhances robustness to varying parametric assumptions.

Numerous methods exist in the design stage of an experiment to address covariate imbalance, including blocked randomization [5] and re-randomization [1,6]. Blocking is a classical approach that is effective for balancing a limited number of discrete covariates. In contrast, re-randomization, as proposed by [1], offers a comprehensive framework for reducing covariate imbalance, thereby enhancing the reliability of causal estimates. In this method, the randomized experiment is repeated until the covariates between treatment groups meet a pre-specified balance criterion. Thus, re-randomization generalizes blocking to accommodate multiple covariates with various values. For an in-depth review of the design and analysis of randomized experiments, please refer to [7].

However, from the perspective of statistical decision-making, randomized experiments are dominated by deterministic experiments [2]. Recent works by [8,9] have introduced optimal deterministic experiments based on assumed parametric potential outcome models. The BODE proposed by [2] depends on the chosen prior for potential outcomes, which adds another layer of uncertainty. All of these optimal designs are sensitive to the specified potential outcome models. In contrast, the proposed MODE is model-independent, thereby mitigating the uncertainty associated with model specification.

The generalized discrepancy utilized in the MODE is analogous to the optimization objective of kernel optimal matching [10], which is employed for covariate balancing in the analysis stage of an experiment. However, the generalized discrepancy is applied during the design stage, while the optimization variables in kernel optimal matching are the weights for the estimator. Unlike post hoc methods, the MODE is implemented entirely in the design phase, ensuring it remains unaffected by outcome data. More convincing reasons for why as much as possible should be done in the design phase of an experiment can be found in [11].

1.2. Main Contributions

Firstly, we introduce a novel type of model-independent deterministic experiment, termed the MODE, which minimizes uncertainty in treatment effect estimation. Secondly, we provide a theoretical proof demonstrating that the proposed MODE reduces covariate imbalance and the estimation MSE of the completely randomized experiment or the re-randomization experiment using Mahalanobis distance, decreasing it from

O_{p} (n^{- 1})

to

o (n^{- 1})

when potential outcomes are fixed. The third contribution of this paper is establishing the relationship between the maximum risk of causal estimators and the generalized discrepancy, emphasizing how reducing this discrepancy enhances the overall information quality in causal inference.

2. Problem Setups

Let

D = {(T_{i}, X_{i}, Y_{i})}_{i = 1}^{n}

denote the observable data for a population of n units, where

T_{i}

is a binary treatment for the i-th unit, i.e., the i-th unit is treated if and only if

T_{i} = 1

.

X_{i}

and

Y_{i}

represent the p-dimensional pre-treatment covariates and the real-valued outcome for the i-th units, respectively. Let

X = {X_{i}}_{i = 1}^{n}

represent all the p-dimensional covariates. We define

n_{1} = \sum_{i = 1}^{n} T_{i}

and

n_{0} = \sum_{i = 1}^{n} (1 - T_{i})

as the numbers of treated and untreated units, respectively. The set

T_{n, n_{1}} = {(T_{1}, \dots, T_{n}) \in {0, 1}^{n}, \sum_{j = 1}^{n} T_{j} = n_{1}}

includes all possible treatment assignments with

n_{1}

treated units.

Under the finite population potential outcomes framework and the stable unit-treatment-value assumption [12], we have

Y_{i} = T_{i} Y_{i} (1) + (1 - T_{i}) Y_{i} (0)

, where

Y_{i} (1)

and

Y_{i} (0)

are the potential outcomes for the i-th unit under the treatment and control, respectively. This paper considers the randomness arising from both treatment assignment and potential outcomes. For

t \in {0, 1}

and

i, j \in [n] ≜ {1, 2 \dots, n}

, we assume that the conditional mean and covariance of potential outcomes given all covariates are as follows:

\begin{matrix} E (Y_{i} (t) ∣ X) & = f_{t} (X_{i}); \\ cov (Y_{i} (t), Y_{j} (t) ∣ X) & = σ_{t}^{2} (X_{i}) I {i = j}; \\ cov (Y_{i} (0), Y_{j} (1) ∣ X) & = 0, \end{matrix}

(1)

where

f_{t} (\cdot)

and

σ_{t}^{2} (\cdot)

are unknown mean and variance functions, respectively, and

I {\cdot}

is the indicator function. The conditions in (1) impose restrictions on the correlations between potential outcomes and covariates, and they hold if

(X_{i}, Y_{i} (0), Y_{i} (1)), i \in [n]

are mutually independent [10].

Typically, the target of interest is to effectively estimate the average treatment effect

τ = n^{- 1} \sum_{i = 1}^{n} E (Y_{i} (1) - Y_{i} (0) ∣ X),

(2)

using the difference-in-means estimator

\hat{τ} = n_{1}^{- 1} \sum_{j = 1}^{n} T_{j} Y_{j} - n_{0}^{- 1} \sum_{j = 1}^{n} (1 - T_{j}) Y_{j} .

(3)

In this paper, we aim to design a deterministic treatment assignment

T \in T_{n, n_{1}}

with

1 \leq n_{1} \leq n - 1

to improve the estimation precision of the difference-in-means estimator. For simplicity, let

δ = (f_{1} (x), f_{0} (x), σ_{1}^{2} (x), σ_{0}^{2} (x))

represent the vector of mean and variance functions. The conditional MSE of

\hat{τ}

given

T

can be decomposed as follows:

MSE (\hat{τ} ∣ T, X, δ) = E {{(\hat{τ} - τ)}^{2} ∣ T, X, δ} = V (T, X, δ) + B (T, X, δ)

where the conditional variance and squared bias are given by the following:

\begin{matrix} V (T, X, δ) & = n_{1}^{- 2} \sum_{j = 1}^{n} T_{j} σ_{1}^{2} (X_{j}) + n_{0}^{- 2} \sum_{j = 1}^{n} (1 - T_{j}) σ_{0}^{2} (X_{j}), \end{matrix}

(4)

\begin{matrix} B (T, X, δ) & = {(n_{1}^{- 1} \sum_{j = 1}^{n} T_{j} f_{1} (X_{j}) - n_{0}^{- 1} \sum_{j = 1}^{n} (1 - T_{j}) f_{0} (X_{j}) - τ)}^{2} . \end{matrix}

(5)

From the above expression, the MSE risk depends on the treatment assignment

T

, the set of covariates

X

, and the vector of unknown functions

δ

. Therefore, to minimize the MSE, we can carefully design

T

by fully utilizing the information contained in

X

and

δ

. Different assumptions regarding

δ

lead to distinct optimal treatment assignments [8,9]. However, these assumptions are unverifiable since the potential outcomes are unobservable. In this paper, we adopt a minimax framework to identify the most robust treatment assignment with respect to a class of unknown mean and variance functions. We do not impose specific functional forms on these unknown functions; instead, we restrict them to belong to a function class

Δ

, as rigorously defined in (7). The maximum risk of a treatment is quantified by the worst-case MSE for that treatment over all possible true mean and variance functions in

Δ

, i.e.,

R (T) = max_{δ \in Δ} MSE (\hat{τ} ∣ T, X, δ) .

(6)

The most robust treatment assignment with respective to

Δ

is the one for which the maximum risk achieves the minimum over all possible treatments in

T_{n} ≜ \cup_{n_{1} = 1}^{n - 1} T_{n, n_{1}}

. More precisely, we define the MODE as follows.

Definition 1.

A treatment

T^{*}

is called a MODE if

R (T^{*}) = {min}_{T \in T_{n}} R (T) .

Therefore, the task of obtaining a MODE can be viewed as a game between the experimenter, who chooses a treatment assignment

T

from

T_{n}

, and nature, which chooses the worst-case potential outcomes structures from

Δ

. Additionally, the MODE is model-independent, as the maximum risk defined in (6) accounts for the uncertainties of all possible potential outcomes models. If there are multiple treatments with the minimum maximum risk, the MODE in Definition 1 is not unique, but they are equivalent in terms of minimizing the maximum risk. Therefore, we can choose any one of them to conduct the experiment.

3. Minimax Optimal Deterministic Experiments

In this section, we first derive two equivalent expressions for the maximum risk in (6). We then consider the MODE, as outlined in Definition 1, under both fixed and divergent sample sizes. Additionally, we provide an algorithm for generating the MODE.

3.1. Expressions for the Maximum Risk

For any

T \in T_{n, n_{1}}

with

1 \leq n_{1} \leq n - 1

, let

X_{1} = {X_{j} ∣ T_{j} = 1, j \in [n]}

and

X_{0} = {X_{j} ∣ T_{j} = 0, j \in [n]}

denote the covariate information of the treated and untreated units, respectively. To ensure that the maximum risk

R (T)

in (6) is well defined, we assume that

Δ = {(f_{1}, f_{0}, σ_{1}^{2}, σ_{0}^{2}) ∣ f_{t} \in Ω, {(f_{t}, f_{t})}_{Ω} \leq γ^{2}, σ_{t}^{2} (x) \leq σ^{2}, t \in {0, 1},

(7)

where

γ^{2}

and

σ^{2}

are pre-specified parameters, and

Ω

is a reproducing kernel Hilbert space, equipped with the inner product

{(\cdot, \cdot)}_{Ω} : Ω \times Ω \to R

and the reproducing kernel

K (\cdot, \cdot) : X \times X \to R

. A crucial feature of this kernel is its reproducibility, meaning that

{(K (x, \cdot), f (\cdot))}_{Ω} = f (x)

for any

x \in X

and

f \in Ω

; see [13,14].

After embedding the mean functions into the reproducing kernel Hilbert space

Ω

, we provide an analytical expression for the maximum risk by bounding the bias term

B (T, X, δ)

using a generalized Koksma–Hlawka inequality.

Theorem 1.

For any

T \in T_{n, n_{1}}

with

1 \leq n_{1} \leq n - 1

, the maximum risk

R (T) = {max}_{δ \in Δ} MSE (\hat{τ} ∣ T, X, δ)

is expressed as follows:

R (T) = σ^{2} (n_{1}^{- 1} + n_{0}^{- 1}) + γ^{2} {(D_{K} (X_{1}, X) + D_{K} (X_{0}, X))}^{2},

where

σ^{2}

and

γ^{2}

are constants specified in Δ, and

D_{K} {(X_{t}, X)}^{2}

is the squared generalized discrepancy between

X_{t}

and

X

, defined by

D_{K} {(X_{t}, X)}^{2} = n^{- 2} \sum_{x, y \in X} K (x, y) - 2 {(n n_{t})}^{- 1} \sum_{x \in X, y \in X_{t}} K (x, y) + n_{t}^{- 2} \sum_{x, y \in X_{t}} K (x, y) .

(8)

Theorem 1 shows that the variance component of the maximum risk is minimized by any balanced experiment in

T_{n, n / 2}

, i.e.,

n_{1} = n_{0} = n / 2

. The bias component of the maximum risk is determined by the sum of the generalized discrepancies of the treated and untreated covariates. The generalized discrepancy defined in (8) can alternatively be expressed as the squared distance between the empirical distribution functions of the point sets

X_{t}

and

X

, denoted as

F_{n_{t}} (x)

and

F_{n} (x)

, respectively. Specifically, we have

D_{K} {(X_{t}, X)}^{2} = d_{Ω} {(F_{n_{t}} (x), F_{n} (x))}^{2},

where

d_{Ω} (\cdot, \cdot)

is the distance metric induced by the inner product

{(\cdot, \cdot)}_{Ω}

. A point set

P \subset X

with a small discrepancy

D_{K} (P, X)

indicates that the empirical distributions of

P

and

X

are close in terms of the distance metric

d_{Ω} (\cdot, \cdot)

.

Different choices of the reproducing kernel Hilbert spaces correspond to different generalized discrepancies, resulting in varying expressions for maximum risks. We provide two widely used reproducing kernel Hilbert spaces, as described in [15,16].

Example 1.

Let

Ω = X_{2} ({[0, 1]}^{p})

, which is the set of all real-valued functions on

{[0, 1]}^{p}

that have square-integrable mixed first derivatives [15]. In this context,

D_{K} (X_{t}, X)

can represent various types of discrepancies, such as the generalized

L_{2}

-discrepancy and the centered

L_{2}

-discrepancy, etc. For instance, the kernel function for the centered

L_{2}

-discrepancy is

K (x, y) = \prod_{j = 1}^{p} (1 + 0.5 | x_{j} - 0.5 | + 0.5 | y_{j} - 0.5 | - 0.5 | x_{j} - y_{j} |)

.

Example 2.

Let

Ω = N (R^{p})

. For the explicit form of this space, refer to [16]. It has been established that the space

N (R^{p})

contains the Sobolev space

W_{(p + 1) / 2, 2}

when p is even; otherwise,

N (R^{p}) = W_{(p + 1) / 2, 2}

, where

W_{s, 2}

is the set of functions whose s-th order derivatives are square-integrable. In this context,

D_{K} (X_{t}, X)

is the energy distance between

X_{t}

and

X

, with the kernel function defined as

K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{i})}^{2}}^{1 / 2}

.

According to the property of the generalized discrepancy defined in (8), we provide another expression for the maximum risk.

Corollary 1.

The maximum risk expression in Theorem 1 is equivalent to the following:

R (T) = σ^{2} (n_{1}^{- 1} + n_{0}^{- 1}) + γ^{2} D_{K} {(X_{1}, X_{0})}^{2} .

Corollary 1 indicates that the bias component of the maximum risk can be represented as the squared generalized discrepancy between the treated covariates

X_{1}

and the untreated covariates

X_{0}

. Consequently, the closer the empirical distributions of

X_{1}

and

X_{0}

are, the smaller the maximum risk becomes. The generalized discrepancy

D_{K} (X_{1}, X_{0})

defined in (8) can be equivalently expressed as a maximum mean discrepancy [17], given by the following:

D_{K} (X_{1}, X_{0}) = max_{ϕ \in Ω, {(ϕ, ϕ)}_{Ω} \leq 1} | n_{1}^{- 1} \sum_{x \in X_{1}} ϕ (x) - n_{0}^{- 1} \sum_{x \in X_{0}} ϕ (x) | .

Thus,

D_{K} (X_{1}, X_{0})

serves as a measure of covariate imbalance that incorporates a range of mean differences rather than focusing solely on a specific mean difference, as done with the Mahalanobis distance defined in (14). A zero generalized discrepancy implies that the covariate distributions in the treatment and control groups are identical. This distributional consistency cannot be achieved through mean matching alone. Additionally, using the equivalent expression for maximum risk in Corollary 1 enhances computational efficiency, as evaluating the maximum risk of a treatment requires calculating the value of the discrepancy function only once.

3.2. MODE and Asymptotic MODE

For any fixed sample size n and discrepancy function

D_{K} (\cdot, \cdot)

in (10), we identify the MODE in Definition 1 by solving the following:

(X_{1}^{*}, X_{0}^{*}) = arg min_{(X_{1}, X_{0}) \in X_{n}} R (T),

(9)

where

X_{n} = {(X_{1}, X_{0}) ∣ X_{1} \subset X, 1 \leq | X_{1} | \leq n - 1, X_{0} = X ∖ X_{1}}

. The treatment

T_{n}^{*}

corresponding to the above

(X_{1}^{*}, X_{0}^{*})

is a MODE. When n is small, we can obtain the MODE

T_{n}^{*}

by enumerating all elements in

X_{n}

. However, this solution depends on the pre-specified parameters

γ^{2}

and

σ^{2}

, and it becomes almost infeasible for large n values since the cardinality of

X_{n}

grows exponentially to

2^{n} - 2

. Consequently, we explore the concept of the asymptotic MODE as n approaches infinity. To make progress, we require that the considered discrepancy function belongs to the following set:

D_{X} = {D_{K} (\cdot, X) ∣ min_{P \subset X, ∣ P ∣ = α n, α \in (0, 1)} D_{K} (P, X) = o (n^{- 1 / 2})} .

(10)

Notably, a Monte Carlo point set of size

α n

with

α \in (0, 1)

will typically have a discrepancy of order

O_{p} (n^{- 1 / 2})

. In contrast, a quasi-Monte Carlo point set, also referred to as representative points and defined as the minima of a discrepancy, possesses a discrepancy of order

o (n^{- 1 / 2})

. This low-discrepancy property supports the theoretical assertion that quasi-Monte Carlo points outperform Monte Carlo points and is fulfilled by commonly used discrepancies. For example, ref. [16] demonstrated that the n-point support points converge at a rate of

O (n^{- 1 / 2} {(log n)}^{- (1 - v) / (2 p)})

for any

v \in (0, 1)

, measured by the energy distance presented in Example 2. Additionally, Theorem 1 of [18] shows that if the empirical distribution function of the covariates

F_{n} (x)

is joint independent, the n-point data-driven subsamples converge at a rate of

O (n^{- 1 + v})

for any

v \in (0, 1 / 2)

, measured by the generalized

L_{2}

-discrepancy presented in Example 1. Thus, all discrepancies in Examples (1) and (2) belong to

D_{X}

.

For any given discrepancy function in

D_{X}

, we identify an asymptotic MODE as follows.

Theorem 2.

For any discrepancy

D_{K} (\cdot, X)

in (10), the balanced treatment assignment

T_{n}^{*}

with

X_{1}^{*} = arg {min}_{X_{1} \subset X, ∣ X_{1} ∣ = n / 2} D_{K} (X_{1}, X ∖ X_{1})

and

X_{0}^{*} = X ∖ X_{1}^{*}

, is an asymptotic MODE, with its maximum risk given by

R (T_{n}^{*}) = {min}_{T \in T_{n}} R (T) = 4 σ^{2} n^{- 1} + o (n^{- 1})

, as

n \to \infty

.

Theorem 2 ensures that the variance component of the maximum risk of the asymptotic MODE is minimized, while its bias component becomes negligible compared to its variance component by minimizing the covariate imbalance, defined by the discrepancy between the treated and untreated covariates. Consequently, experimental units with non-homogeneous covariates perform comparably to those with homogeneous covariates when utilizing the asymptotic MODE.

3.3. Algorithm for Generating Asymptotic MODE

Based on Theorem 2, we can implement the asymptotic MODE in the following steps. Firstly, we search for

n / 2

points from

X

that have the lowest discrepancy relative to the entire covariate dataset

X

. This subset is denoted as

X_{1}^{*}

, and the remaining

n / 2

covariates are denoted as

X_{0}^{*}

, i.e.,

X ∖ X_{1}^{*}

. Then, we can assign the treatment to the units with

X_{1}^{*}

(or

X_{0}^{*}

) and the control to the units with

X_{0}^{*}

(or

X_{1}^{*}

). The above processes are summarized in Algorithm 1.

Algorithm 1: MODE: minimax optimal deterministic experiment
	Input: $X = {X_{i}}_{i = 1}^{n}$ : the covariates of n experimental units; $K (\cdot, \cdot)$ : a valid kernel function defined on $X \times X$ ; M: the maximum allowable sampling times.
	Output: The MODE with treatment units $X_{1}^{}$ and control units $X_{0}^{}$ .
1	if $K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}^{1 / 2}$ then
2		$X_{1}^{*} = T w i n (X, n / 2),$ where $T w i n (X, n / 2)$ is the $n / 2$ points obtained by the twinning method [19].
3	else if $K (x, y) = K (T_{X} (x), T_{X} (y))$ then
4		$X_{1}^{*} = DDS (X, n / 2)$ , where $DDS (X, n / 2)$ is the $n / 2$ points obtained by the data-driven subsampling method [18].
5	else
6		for m=1:M do
7			$X_{1 m} = S R S (X, n / 2), D_{m} = D_{K} (X_{1 m}, X ∖ X_{1 m})$ .
8		end
9		$m^{} = arg {min}_{m = 1 : M} D_{m}, X_{1}^{} = X_{1 m^{*}}$ .
10	end
11	$X_{0}^{} = X ∖ X_{1}^{}$ .

This algorithm systematically selects treatment and control groups to minimize covariate imbalance, thereby enhancing the robustness of treatment assignment. Any valid discrepancy on

X \times X

can be used as the input for Algorithm 1. The optimality of the MODE is guaranteed by Theorems 2 and 3, provided the selected discrepancy belongs to

D_{K}

.

We tailor different optimization methods to various discrepancy functions. For continuous covariates, we recommend using the energy distance [16] or the empirical F-discrepancy [18]. When employing the energy distance as the discrepancy, characterized by the kernel

K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}^{1 / 2}

, we recommend utilizing the twinning method proposed by [19] to identify

X_{1}^{*}

. For the empirical F-discrepancy using the kernel

K (x, y) = K (T_{X} (x), T_{X} (y))

, where

T_{X} (x) = {(F_{1} (x_{1}), \dots, F_{p} (x_{p}))}^{T}

,

F_{j} (x_{j})

is the empirical distribution function of the j-th component of

X

, and

K (\cdot, \cdot)

is any valid kernel on

{[0, 1]}^{p} \times {[0, 1]}^{p}

, we recommend the data-driven subsampling method proposed by [18] to find

X_{1}^{*}

. For the details of the twinning and the data-driven subsampling methods in Algorithm 1, please refer to [19] and [18], respectively. For discrete covariates, the Lee-discrepancy [20] is preferred. For other valid discrepancies, a Monte Carlo approximation method is applied. This involves randomly sampling M subsets from

X

and selecting the one,

X_{1}^{*}

, with the minimum discrepancy value. Ref. [2] demonstrated that after

M = 500

times sampling, the probability that the output

X_{1}^{*}

is better then 99% of all possible candidates is already larger than 99%, and significantly larger values of M are also feasible in practice.

Example 3.

This example demonstrates the usefulness of the MODE procedure using the red wine quality dataset from the UCI databases library. We evaluate the impact of treatments, such as a new storage method, on red wine quality. Two key physicochemical properties, "pH” and “alcohol,” which may influence the final outcome, are treated as covariates. We assume that all these covariates of the red wines have been collected. To minimize the influence of covariates on the estimation of treatment effects, it is essential to ensure that the covariates of the treatment and control groups are as similar as possible.

For clarity, we focus on the first 100 samples in the dataset, treating them as experimental units, and scale their covariates into the range

{[0, 1]}^{2}

. Given that the covariates are continuous, we adopt the energy distance with the kernel function

K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}^{1 / 2}

in Algorithm 1. Figure 1 illustrates the covariates of equally sized treatment and control groups divided by the CRE and MODE methods. Intuitively, the covariates of the treatment and control groups created by MODE appear more balanced than those by CRE, particularly near the

(1, 1)

point. Quantitatively, the Mahalanobis distance between the covariates of the treatment and control groups under MODE is

0.05

, significantly smaller than the

3.41

observed with CRE. This demonstrates that MODE provides a more effective approach for achieving covariate balance, which is crucial for reducing bias in treatment effect estimation.

4. Theoretical Comparison with Randomized Experiments

In this section, we theoretically compare the proposed MODE with completely randomized experiments and re-randomized experiments. When the potential outcomes are fixed, the MODE reduces the MSE of the CRE or the re-randomization using Mahalanobis distance, from

O_{p} (n^{- 1})

to

o (n^{- 1})

. Thus, the proposed MODE serves as a super effective covariate balance technique.

4.1. Comparison with Completely Randomized Experiments

For simplicity, we denote the difference-in-means estimators based on the CRE and the MODE as

{\hat{τ}}_{c r e}

and

{\hat{τ}}_{m o d e}

, respectively. Under the CRE, where

T

is randomly sampled form

T_{n, n_{1}}

with

1 \leq n_{1} \leq n - 1

,

{\hat{τ}}_{c r e}

serves as an unbiased estimator for

τ

, i.e.,

E ({\hat{τ}}_{c r e} ∣ X, δ) = τ

, with a conditional variance given by the following:

var ({\hat{τ}}_{c r e} ∣ X, δ) = {(\binom{n}{n_{1}})}^{- 1} \sum_{T \in T_{n, n_{1}}} MSE ({\hat{τ}}_{c r e} ∣ T, X, δ) .

(11)

The expression in (11) indicates that this randomization variance under the CRE equals the average of the conditional MSEs across treatments in

T_{n, n_{1}}

. In contrast, the MODE in (1) minimizes the maximum MSE over a class of mean and variance functions.

We define

S^{2} (f_{t}) = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {(f_{t} (X_{i}) - {\bar{f}}_{t})}^{2}

for

t \in {0, 1}

, and

S^{2} (f_{1}, f_{0}) = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {(τ_{i} - τ)}^{2}

, where

{\bar{f}}_{t} = n^{- 1} \sum_{i = 1}^{n} f_{t} (X_{i})

for

t \in {0, 1}

, and

τ_{i} = f_{1} (X_{i}) - f_{0} (X_{i})

for

i \in [n]

. The above randomization variance can be expressed more precisely as follows (see Appendix A for details):

\begin{matrix} var ({\hat{τ}}_{c r e} ∣ X, δ) = E_{V} (X, δ) + V_{E} (X, δ) \\ E_{V} (X, δ) = n_{1}^{- 1} S^{2} (f_{1}) + n_{0}^{- 1} S^{2} (f_{0}) - n^{- 1} S^{2} (f_{1}, f_{0}), \\ V_{E} (X, δ) = {(n n_{1})}^{- 1} \sum_{i = 1}^{n} σ_{1}^{2} (X_{i}) + {(n n_{0})}^{- 1} \sum_{i = 1}^{n} σ_{0}^{2} (X_{i}) . \end{matrix}

(12)

The term

E_{V} (X, δ)

aligns with the variance expression when the potential outcomes are fixed [21], while the term

V_{E} (X, δ)

quantifies the randomness stemming from the potential outcomes.

Following the percent reduction in variance proposed by [1], we define the percent reduction in MSE as the percentage by which the MODE reduces the randomization MSE of the difference-in-means estimator:

PR ({\hat{τ}}_{c r e}, {\hat{τ}}_{m o d e} ∣ X, δ) = \frac{100 (var ({\hat{τ}}_{c r e} ∣ X, δ) - MSE ({\hat{τ}}_{m o d e} ∣ T_{n}^{*}, X, δ))}{var ({\hat{τ}}_{c r e} ∣ X, δ)}

(13)

where

T_{n}^{*}

is the asymptotic MODE provided in Theorem 2. It is important to note that this percent reduction in MSE depends on the covariates

X

and the unknown mean and variance functions in

δ

. Under certain mild conditions, we provide a lower bound for this percent reduction in MSE.

Assumption 1.

(i)

0 < \underset{n \to \infty}{lim inf} n_{1} / n \leq \underset{n \to \infty}{lim sup} n_{1} / n < 1

; (ii)

n^{- 1} \sum_{i = 1}^{n} f_{t} {(X_{i})}^{2} = O_{p} (1)

,

n^{- 1} \sum_{i = 1}^{n} σ_{t}^{2} (X_{i}) = O_{p} (1)

for

t \in {0, 1}

.

The first condition is necessary for the existence of the difference-in-means estimator. The second condition ensures that certain moments of the covariates are bounded in probability, which is easily satisfied if the mean and variance functions are bounded.

Theorem 3.

Under the mild conditions specified in Assumption 1, the percent reduction in MSE defined in (13) satisfies the following:

lim_{n \to \infty} PR ({\hat{τ}}_{c r e}, {\hat{τ}}_{m o d e} ∣ X, δ) \geq \frac{E_{V} (X, δ)}{E_{V} (X, δ) + V_{E} (X, δ)} > 0 .

Theorem 3 establishes that the proposed MODE effectively reduces the MSE of the difference-in-means estimator compared to the CRE. This reduction becomes more pronounced when the term

V_{E} (X, δ)

defined in (12) is small, or when the term

E_{V} (X, δ)

defined in (12) is large. An intuitive explanation of Theorem 3 is that while the CRE balances covariates on average, the actual distributions of covariates in the treatment and control groups may remain unbalanced in a specific experiment, resulting in larger estimation variance. In contrast, the MODE carefully arranges the treated and untreated units to ensure that their covariate distributions align as closely as possible, thereby achieving a smaller MSE.

It is evident that under Assumption 1,

var ({\hat{τ}}_{c r e} ∣ X, δ) = O_{p} (n^{- 1})

. If the potential outcomes are fixed, i.e.,

σ_{1}^{2} (x) \equiv 0

and

σ^{2} (x) \equiv 0

, Theorem 2 implies that

MSE ({\hat{τ}}_{m o d e} ∣ T_{n}^{*}, X, δ) = o (n^{- 1})

. Thus, the MODE demonstrates superiority over the CRE.

Corollary 2.

Under the mild conditions outlined in Assumption 1, if

σ_{1}^{2} (x) \equiv 0

and

σ^{2} (x) \equiv 0

, then the MODE reduces the MSE of the CRE from

O_{p} (n^{- 1})

to

o (n^{- 1})

, i.e.,

{lim}_{n \to \infty} PR ({\hat{τ}}_{c r e}, {\hat{τ}}_{m o d e} ∣ X, δ_{0}) = 100 %

with

δ_{0} = (f_{1} (x), f_{2} (x), 0, 0)

.

Corollary 2 indicates that the convergence rate of the MSE is improved by employing the MODE compared to the CRE when the potential outcomes are fixed. This further establishes the superiority of the proposed MODE.

4.2. Comparison with Re-Randomized Experiments

In this subsection, we demonstrate that the proposed MODE outperforms the re-randomized experiment using Mahalanobis distance [1], referred to as ReM, in terms of minimizing the MSE. Furthermore, the proposed MODE can also be identified as a specific case of a re-randomized experiment.

The re-randomization technique enhances the performance of a randomized experiment by excluding assignments that yield unbalanced covariate distributions prior to the experiment’s initiation. Specifically, a re-randomization criterion is defined as a binary function

ϕ

such that a treatment assignment

T

belongs to the restricted randomization set if and only if

ϕ (T) = 1

. The restricted set for ReM is defined as follows:

ϕ_{m} (T) = I {T \in T_{n, n_{1}}, M (X_{1}, X_{0}) \leq a},

where a is a pre-specified constant, and

M (X_{1}, X_{0})

is the Mahalanobis distance between the treatment and control groups defined by the following:

M (X_{1}, X_{0}) = \frac{n_{1} n_{0}}{n} {({\bar{X}}_{1} - {\bar{X}}_{0})}^{T} cov {(X)}^{- 1} ({\bar{X}}_{1} - {\bar{X}}_{0}),

(14)

where

{\bar{X}}_{1} = n_{1}^{- 1} \sum_{i = 1}^{n} T_{i} X_{i}

,

{\bar{X}}_{0} = n_{1}^{- 1} \sum_{i = 1}^{n} (1 - T_{i}) X_{i}

, and

cov (X)

is the sample covariance matrix. The ReM is considered balanced when

n_{1} = n / 2

.

Consider the scenario where the potential outcomes are fixed and the causal effect is additive, i.e.,

σ_{1}^{2} (x) \equiv 0, σ_{0}^{2} (x) \equiv 0

and

Y_{i} (1) = Y_{i} (0) + τ

for

i \in [n]

. In this case, the balanced ReM reduces the variance of the CRE to

{1 - (1 - v_{a}) R^{2}} var ({\hat{τ}}_{c r e} ∣ X, δ)

(Theorem 3.2 of [1]), where

v_{a} = P (χ_{p + 2}^{2} \leq a) / P (χ_{p}^{2} \leq a)

,

χ_{p}^{2}

is the Chi-square distribution with p degrees of freedom, and

R^{2}

represents the squared multiple correlation between

{Y_{i} (0)}_{i = 1}^{n}

and

{X_{i}}_{i = 1}^{n}

. Thus, the balanced ReM enjoys the same convergence rate with the CRE. We summarize this conclusion as the following corollary.

Corollary 3.

If

σ_{1}^{2} (x) \equiv 0, σ^{2} (x) \equiv 0

,

Y_{i} (1) = Y_{i} (0) + τ

for

i \in [n]

, and

n^{- 1} \sum_{i = 1}^{n} f_{t} {(X_{i})}^{2} = O_{p} (1)

for

t \in {0, 1}

, then the variance of the balanced ReM is of order

O_{p} (n^{- 1})

.

Theorem 2 implies that

MSE ({\hat{τ}}_{m o d e} ∣ T_{n}^{*}, X, δ) = o (n^{- 1})

if

σ_{1}^{2} (x) \equiv 0

and

σ^{2} (x) \equiv 0

. Thus, as a result of Corollary 3, the proposed MODE improves the convergence rate of the estimated treatment effect compared to the balanced ReM from

O_{p} (n^{- 1})

to

o (n^{- 1})

. This further reinforces the superiority of using the proposed MODE.

Next, we establish a connection between the MODE and re-randomization. In our setting, the restricted set for the MODE,

T_{n}^{*}

, can be defined as follows:

ϕ_{m o d e} (T) = I {T \in T_{n, n / 2}, D_{K} (X_{1}, X_{0}) = min_{(X_{1}, X_{0}) \in X_{n}} D_{K} (X_{1}, X_{0})},

where

X_{n} = {(X_{1}, X_{0}) ∣ X_{1} \subset X, 1 \leq | X_{1} | \leq n - 1, X_{0} = X ∖ X_{1}}

. The unknown minimum value of the discrepancy in this re-randomization criterion can be estimated via Monte Carlo sampling. Specifically, we can randomly sample from

T_{n, n / 2}

a feasible number of times and take the minimum discrepancy among those samples as its estimate.

To leverage the strengths of randomized-based inference, we observe that the restricted set

ϕ_{m o d e} (T)

typically comprises either singletons or sets of small cardinality. Therefore, we relax the re-randomization criterion

ϕ_{m o d e} (T)

to the following:

ϕ_{m o d e, α} (T) = I {T \in T_{n, n / 2}, D_{K} (X_{1}, X_{0}) \leq {\hat{q}}_{D_{K}, α}},

(15)

where

{\hat{q}}_{D_{K}, α}

is the empirical

α

-quantile of the distribution of all discrepancy values over

X_{n}

. It is evident that the above re-randomization,

ϕ_{m o d e, α} (T)

, degenerates into the MODE as

α \to 0

and into the CRE as

α \to 1

. Thus, the re-randomized relaxation in (15) with

α \in (0, 1)

effectively combines the randomness of the CRE with the efficiency of the MODE, making it useful when the experimenter is constrained to use randomization due to practical considerations.

5. Simulations

In this section, we compare the performance of several experimental approaches, including the balanced CRE, the balanced ReM utilizing the critical value

a = χ_{p, 0.2}^{2}

[1], the BODE with default priors as outlined by [2], and the proposed MODE employing the energy distance with the kernel function

K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}^{1 / 2}

. Each experiment is repeated 1000 times across all settings. The simulation results indicate that the proposed MODE demonstrates reduced covariate imbalance and lower estimation uncertainty compared to existing methods.

5.1. Covariate Imbalance

We examine two sample size and dimensionality scenarios: (i)

n = 1000

and

p = 5

; (ii)

n = 2000

and

p = 10

. In each case, the covariates are generated as

X_{i j} \overset{i . i . d .}{\sim} U [- 3, 3]

and

X_{i j} \overset{i . i . d .}{\sim} N (0, 3)

for

i \in [n], j \in [p]

. We then apply the four experimental designs to the covariate set

X = {{(X_{i 1}, \dots, X_{i j})}^{T}}_{i = 1}^{n}

. The Mahalanobis distance

M (X_{1}, X_{0})

, as defined in (14), and the energy distance

D_{K} (X_{1}, X_{0})

, which represents the discrepancy function in (8) with the kernel function

K (x, y) = - {\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}^{1 / 2}

, are employed as two measures of covariate imbalance.

Figure 2 illustrates the covariate imbalances across the various experimental methods. The CRE method balances covariates on average; however, many CRE instances exhibit substantial covariate imbalance. In contrast, the ReM method maintains both the Mahalanobis distance and energy distance of covariates between treatment and control groups within a smaller range. The BODE further enhances covariate balance through Monte Carlo optimization. Notably, the proposed MODE demonstrates the smallest covariate imbalance across different settings. These findings suggest that the experimental units in the treatment and control groups based on MODE are more homogeneous, resulting in more comparable experimental outcomes.

5.2. Estimation Precision

To evaluate the precision of difference-in-means estimators derived from various experiments, we assume that the responses

Y_{1}, \dots, Y_{n}

are generated from the following potential outcomes model:

\{\begin{matrix} Y_{i} = T_{i} Y_{i} (1) + (1 - T_{i}) Y_{i} (0), i \in [n]; \\ Y_{i} (1) = f_{1} (X_{i}) + σ_{1} (X_{i}) u_{i}, i \in [n]; \\ Y_{i} (0) = f_{0} (X_{i}) + σ_{0} (X_{i}) v_{i}, i \in [n]; \\ {X_{i j}, i \in [n], j \in [p]} \overset{i . i . d .}{\sim} U [- 3, 3]; \\ {u_{i}, i \in [n]} \overset{i . i . d .}{\sim} N (0, 1), {v_{i}, i \in [n]} \overset{i . i . d .}{\sim} N (0, 1); \end{matrix}

where

T_{i} = 1

if and only if the i-th unit is treated in the corresponding experiment. We consider four different specifications for mean and variance functions:

C1.: $f_{1} (x) = e^{\tilde{x}}, f_{0} (x) = 1 + \tilde{x}, σ_{1}^{2} (x) \equiv 0, σ_{0}^{2} (x) \equiv 0$ ;
C2.: $f_{1} (x) = e^{\tilde{x}}, f_{0} (x) = 1 + \tilde{x}, σ_{1}^{2} (x) \equiv 1, σ_{0}^{2} (x) \equiv 1$ ;
C3.: $f_{1} (x) = e^{\tilde{x}} + e^{\tilde{x} / 2}, f_{0} (x) = e^{\tilde{x}} - e^{\tilde{x} / 2}, σ_{1}^{2} (x) \equiv 0, σ_{0}^{2} (x) \equiv 0$ ;
C4.: $f_{1} (x) = e^{\tilde{x}} + e^{\tilde{x} / 2}, f_{0} (x) = e^{\tilde{x}} - e^{\tilde{x} / 2}, σ_{1}^{2} (x) \equiv 1, σ_{0}^{2} (x) \equiv 1$ ,

where

\tilde{x} = \sum_{j = 1}^{p} x_{j} / \sqrt{p}

. In C1 and C3, we assume that the potential outcomes are fixed.

Table 1 presents the empirical MSEs of different-in-means estimators across the various experiments. As the sample sizes increase, the empirical MSEs decrease, as expected; however, the percentage reductions in MSE of the MODE compared to the CRE become more pronounced, indicating that the MSEs associated with the MODE converge more rapidly. Under each configuration of dimension and sample size, the ReM reduces the randomization variance of the CRE by constraining the Mahalanobis distances of the CREs. The BODE generally yields a smaller MSE than the ReM. Notably, the proposed MODE achieves the smallest MSE across various scenarios. Furthermore, the percentage reduction in MSE of the MODE over the CRE is particularly significant when the potential outcomes are fixed (C1 and C3), consistent with the optimality outlined in Corollary 2.

We also compare the differences between the true distribution of potential outcomes and the empirical distributions of the observed outcomes derived from various experiments. Table 2 presents the Kolmogorov–Smirnov values

{sup}_{y \in R} | F (y) - F_{n} (y) |

for the different experiments, where

F (y)

and

F_{n} (y)

represent the empirical distribution functions of the potential and observed outcomes from the treatment group, respectively.

As anticipated, all Kolmogorov–Smirnov values decrease, indicating that the observed outcomes converge to the true potential outcomes in distribution as the sample sizes increase. Across all settings, the MODE yields the lowest Kolmogorov–Smirnov values, suggesting that the observed outcomes are closer to the true potential outcomes in distribution. This property of matching distributions, which benefits from quasi-Monte Carlo techniques, also enhances the estimation of treatment effects beyond the average treatment effect.

6. Conclusions and Discussion

This paper introduces a novel measure of covariate imbalance, termed the generalized discrepancy, and proposes a MODE designed to minimize this discrepancy. Both theoretical analysis and simulations demonstrate that the proposed MODE outperforms existing methods, such as CRE and ReM, in terms of minimizing MSE. Thus, the MODE serves as a super-effective framework for controlled experimentation in the presence of covariates.

Importantly, the covariate imbalance, defined as the generalized discrepancy, quantifies the differences in covariate distributions between treatment and control groups. This measure can also be utilized in designing experiments aimed at improving the accuracy of estimating various causal effects, such as quantile treatment effects and average treatment effects on the treated group. Such applications enable researchers to evaluate the impacts of interventions more comprehensively.

For most control experiments where estimation accuracy is the primary objective, such as online A/B tests, the proposed MODE is the preferred method. Since MODE is deterministic, randomization inference based on the MODE is limited [22]; however, asymptotic inference remains feasible. In situations where exact inference is the primary goal or randomization is necessary due to practical constraints—such as policy requirements, ethical considerations, or the presence of potential unobserved confounding variables—we recommend using re-randomization based on the MODE described in (15). This approach randomizes a set of treatments that are near the minima of the maximum risk, thus enhancing covariate balance.

The MODE can be directly extended to multi-level experiments by sequentially minimizing generalized discrepancies across treatment groups. Thus, it offers a potential alternative to the OSAT tool [23] for sample-to-batch allocation in genomics experiments. The MODE proposed in this paper assumes that all covariates are collected in advance of conducting the experiment, including scenarios such as online A/B tests, classrooms with student data, or companies with employee records. Another potential future research direction involves adaptive allocation, particularly in biomedical experiments with human subjects, where treatments must be assigned either individually or in batches as participants are enrolled.

Author Contributions

Conceptualization, S.X. and Y.Z.; methodology, S.X.; software, S.X.; validation, S.X. and S.L.; formal analysis, S.X.; investigation, S.X.; resources, Y.Z. and S.L; writing—original draft preparation, S.X.; writing—review and editing, Y.Z. and S.L.; visualization, S.X.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities (project 24CX06030A), the Shandong Provincial Natural Science Foundation (project ZR2024QA098), the National Natural Science Foundation of China (12131001), and the Fundamental Research Funds for the Central Universities in Nankai University, LPMC, and KLMDASR.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The R codes for generating the figures and tables in this paper are available at the following website: https://github.com/Kids1997/R-packages/blob/main/MODE (accessed on 6 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BODE	Bayesian Optimal Deterministic Experiment
CRE	Completely Randomized Experiment
MODE	Minimax Optimal Deterministic Experiment
MSE	Mean Squared Error
ReM	Re-randomized Experiment using Mahalanobis distance

Appendix A. The Randomization Variance for ${\hat{τ}}_{c r e}$

To derive the explicit expression for the randomization variance of

{\hat{τ}}_{c r e}

, we apply the law of total variance, which implies the following:

var ({\hat{τ}}_{c r e} ∣, X, δ) = var (E ({\hat{τ}}_{c r e} ∣ Y, X, δ)) + E (var ({\hat{τ}}_{c r e} ∣ Y, X, δ)) .

(A1)

On the one hand, the first term can be computed as follows:

var (E ({\hat{τ}}_{c r e} ∣ Y, X, δ)) = n^{- 2} \sum_{i = 1}^{n} (σ_{1}^{2} (X_{i}) + σ_{0}^{2} (X_{i})) .

(A2)

On the other hand, ref. [12] demonstrated that the conditional variance of

{\hat{τ}}_{c r e}

given

Y = {Y_{i}}_{i = 1}^{n}

is as follows:

var ({\hat{τ}}_{c r e} ∣ Y, X, δ) = n_{1}^{- 1} S^{2} (1) + n_{0}^{- 1} S^{2} (0) - n^{- 1} S^{2} (v),

where

S^{2} (t) = {(n - 1)}^{- 1} \sum_{j = 1}^{n} {(Y_{i} (t) - {\bar{Y}}_{1} (t))}^{2}

,

\bar{Y} (t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (t)

for

t \in {0, 1}

,

S^{2} (v) = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {(v_{i} - v)}^{2}

,

v_{i} = Y_{i} (1) - Y_{i} (0)

, and

v = n^{- 1} \sum_{i = 1}^{n} v_{i}

. Taking the expectation with respect to

Y

and applying algebra yields the following:

\begin{matrix} E (var ({\hat{τ}}_{c r e} ∣ Y, X, δ)) = & n_{1}^{- 1} S^{2} (f_{1}) + n_{0}^{- 1} S^{2} (f_{0}) - n^{- 1} S^{2} (f_{1}, f_{0}) \\ + {(n n_{1})}^{- 1} \sum_{i = 1}^{n} σ_{1}^{2} (X_{i}) + {(n n_{0})}^{- 1} \sum_{i = 1}^{n} σ_{0}^{2} (X_{i}) \\ - n^{- 2} \sum_{i = 1}^{n} (σ_{1}^{2} (X_{i}) + σ_{0}^{2} (X_{i})) . \end{matrix}

(A3)

The desired expression for the randomization variance is obtained by substituting Equations (A2) and (A3) into Equation (A1).

Appendix B. Proofs

Proof of Theorem 1.

Note that the bias term in (5) can be equivalently expressed as follows:

B (T, X, δ) = {(Q (f_{1}, X_{1}, X) - Q (f_{0}, X_{1}, X))}^{2},

where

Q (f_{t}, X_{t}, X) = n_{1}^{- 1} \sum_{x \in X_{t}} f_{t} (x) - N^{- 1} \sum_{x \in X} f_{t} (x)

represents the average difference of

f_{t} (\cdot)

across the sets

X_{t}

and

X

for

t \in {0, 1}

.

We define the linear operator

L : Ω \to R

as

L f = Q (f, X_{t}, X) = \int f (x) d [F_{n_{t}} - F_{n}] (x)

, where

F_{n_{t}} (x)

and

F_{n} (x)

are the empirical distribution functions of the sets

X_{t}

and

X

. It is evident that

L

is a bounded operator. Based on the Riesz representation theorem, there exists a unique representer

ξ \in Ω

satisfying

L f = {(f, ξ)}_{Ω}

for all

f \in Ω

.

Let

K_{u} (v) = K (u, v)

. Due to the symmetry and reproducing properties of the kernel

K (\cdot, \cdot)

, we have

ξ (u) = {(ξ (\cdot), K_{u} (\cdot))}_{Ω} = L K_{u} (\cdot) = \int K_{x} (u) d [F_{n_{t}} - F_{n}] (x)

. Now, consider the following:

\begin{matrix} {(ξ, ξ)}_{Ω} & = {(\int K_{x} (\cdot) d [F_{n_{t}} - F_{n}] (x), \int K_{y} (\cdot) d [F_{n_{t}} - F_{n}] (y))}_{Ω} \\ = \int \int {(K_{x} (\cdot), K_{y} (\cdot))}_{Ω} d [F_{n_{t}} - F_{n}] (x) d [F_{n_{t}} - F_{n}] (y) \\ = \int \int K (x, y) d [F_{n_{t}} - F_{n}] (x) d [F_{n_{t}} - F_{n}] (y) \\ = n^{- 2} \sum_{x, y \in X} K (x, y) - 2 {(n n_{t})}^{- 1} \sum_{x \in X, y \in X_{t}} K (x, y) + n_{t}^{- 2} \sum_{x, y \in X_{t}} K (x, y) \\ = D_{K} (X_{t}, X) . \end{matrix}

Consequently, the Cauchy–Schwarz inequality implies that

| Q (f_{t}, X_{t}, X) | = | L f_{t} | = | {(f_{t}, ξ)}_{Ω} | \leq ∥ f_{t} ∥_{Ω} {∥ ξ ∥}_{Ω} = {∥ f_{t} ∥}_{Ω} D_{K} (X_{t}, X), for t \in {0, 1} .

Thus, for any

δ \in Δ

, the bias term

B (T, X, δ) \leq γ^{2} {(D_{K} (X_{1}, X) + D_{K} (X_{0}, X))}^{2}

with equality if and only if

(f_{1} (x), f_{0} (x)) \equiv γ (\pm ξ (x), \mp ξ (x))

. The variance term

V (T, X, δ) \leq σ^{2} (n_{1}^{- 1} + n_{0}^{- 1})

with equality if and only if

σ_{t}^{2} (x) \equiv σ^{2}

for

t \in {0, 1}

. This establishes the expression for the maximum risk. □

Proof of Corollary 1.

It is evident that

X_{1} = X ∖ X_{0}

and

X_{0} = X ∖ X_{1}

. According to Proposition 1 of [19], we have

D_{K} (X_{1}, X) = (n_{0} / n) D_{k} (X_{1}, X_{0})

and

D_{K} (X_{0}, X) = (n_{1} / n) D_{k} (X_{0}, X_{1})

. The symmetry of

K (\cdot, \cdot)

implies that

D_{k} (X_{0}, X_{1}) = D_{k} (X_{1}, X_{0})

. Thus,

D_{K} (X_{1}, X) + D_{K} (X_{0}, X) = D_{K} (X_{1}, X_{0})

, which completes the proof. □

Proof of Theorem 2.

On the one hand, the expression in Theorem 1 implies that

R (T) \geq 4 σ^{2} n^{- 1}

for any

T \in T_{n}

. On the other hand, the definition of

X_{1}^{*}

implies that

D_{K} (X_{1}^{*}, X_{0}^{*}) = 2 D_{K} (X_{1}^{*}, X) = 2 {min}_{P \subset X, ∣ P ∣ = n / 2} D_{K} (P, X) = o (n^{- 1 / 2})

. Thus,

R (T_{n}^{*}) = 4 σ^{2} n^{- 1} + o (n^{- 1}) \sim 4 σ^{2} n^{- 1}

as

n \to \infty

. This proves that

T_{n}^{*}

is an asymptotic minimax optimal deterministic experiment. □

Proof of Theorem 3.

When

σ_{t}^{2} (x) \equiv σ^{2}

for

t \in {0, 1}

and the conditions in Theorem 3 hold, we know that

V_{E} (X, δ) = σ^{2} (n_{1}^{- 1} + n_{0}^{- 1})

and

E_{V} (X, δ) = O (n^{- 1})

. According to Theorem 2, we have

V (T_{n}^{*}, X, δ) = 4 σ^{2} n^{- 1}

, and

B (T_{n}^{*}, X, δ) = o (n^{- 1})

. Thus,

\begin{matrix} PR ({\hat{τ}}_{c r e}, {\hat{τ}}_{m o d e} ∣ X, δ_{0}) & = \frac{V_{E} (X, δ) + E_{V} (X, δ_{0}) - (V (T_{n}^{*}, X, δ) + B (T_{n}^{*}, X, δ))}{E_{V} (X, δ_{0}) + V_{E} (X, δ_{0})} \\ = \frac{(σ^{2} (n_{1}^{- 1} + n_{0}^{- 1}) - 4 σ^{2} n^{- 1}) + E_{V} (X, δ_{0})}{E_{V} (X, δ_{0}) + V_{E} (X, δ_{0})} + o (1) \\ \geq \frac{E_{V} (X, δ_{0})}{E_{V} (X, δ_{0}) + V_{E} (X, δ_{0})} + o (1), \end{matrix}

which completes the proof. □

References

Morgan, K.L.; Rubin, D.B. Rerandomization to improve covariate balance in experiments. Ann. Stat. 2012, 40, 1263–1282. [Google Scholar] [CrossRef]
Kasy, M. Why experimenters might not always want to randomize, and what they could do instead. Political Anal. 2016, 24, 324–338. [Google Scholar] [CrossRef]
Deaton, A.; Cartwright, N. Understanding and misunderstanding randomized controlled trials. Soc. Sci. Med. 2018, 210, 2–21. [Google Scholar] [CrossRef] [PubMed]
Ho, D.E.; Imai, K.; King, G.; Stuart, E.A. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Anal. 2007, 15, 199–236. [Google Scholar] [CrossRef]
Liu, H.; Ren, J.; Yang, Y. Randomization-based joint central limit theorem and efficient covariate adjustment in randomized block 2^K factorial experiments. J. Am. Stat. Assoc. 2024, 119, 136–150. [Google Scholar] [CrossRef]
Li, X.; Ding, P.; Rubin, D.B. Asymptotic theory of rerandomization in treatment–control experiments. Proc. Natl. Acad. Sci. USA 2018, 115, 9157–9162. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Li, X. Some theoretical foundations for the design and analysis of randomized experiments. arXiv 2024, arXiv:2406.10444. [Google Scholar] [CrossRef]
Bhat, N.; Farias, V.F.; Moallemi, C.C.; Sinha, D. Near-optimal A-B testing. Manag. Sci. 2020, 66, 4477–4495. [Google Scholar] [CrossRef]
Zhang, Q.; Kang, L. Locally Optimal Design for A/B Tests in the Presence of Covariates and Network Dependence. Technometrics 2022, 64, 358–369. [Google Scholar] [CrossRef]
Kallus, N.; Pennicooke, B.; Santacatterina, M. More robust estimation of average treatment effects using kernel optimal matching in an observational study of spine surgical interventions. Stat. Med. 2021, 40, 2305–2320. [Google Scholar] [CrossRef] [PubMed]
Rubin, D.B. For objective causal inference, design trumps analysis. Ann. Appl. Stat. 2008, 2, 808–840. [Google Scholar] [CrossRef]
Imbens, G.W.; Rubin, D.B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
Wahba, G. Spline Models for Observational Data; SIAM: Philadelphia, PA, USA, 1990. [Google Scholar]
Wendland, H. Scattered Data Approximation; Cambridge University Press: Cambridge, UK, 2004; Volume 17. [Google Scholar]
Hickernell, F. A generalized discrepancy and quadrature error bound. Math. Comput. 1998, 67, 299–322. [Google Scholar] [CrossRef]
Mak, S.; Joseph, V.R. Support points. Ann. Stat. 2018, 46, 2562–2592. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 2006, 19, 513–520. [Google Scholar]
Zhang, M.; Zhou, Y.; Zhou, Z.; Zhang, A. Model-free subsampling method based on uniform designs. IEEE Trans. Knowl. Data Eng. 2023, 36, 1210–1220. [Google Scholar] [CrossRef]
Vakayil, A.; Joseph, V.R. Data twinning. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 598–610. [Google Scholar] [CrossRef]
Zhou, Y.D.; Ning, J.H.; Song, X.B. Lee discrepancy and its applications in experimental designs. Stat. Probab. Lett. 2008, 78, 1933–1942. [Google Scholar] [CrossRef]
Rubin, D.B. Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
Edgington, E.; Onghena, P. Randomization Tests; Chapman & Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
Yan, L.; Ma, C.; Wang, D.; Hu, Q.; Qin, M.; Conroy, J.M.; Sucheston, L.E.; Ambrosone, C.B.; Johnson, C.S.; Wang, J.; et al. OSAT: A tool for sample-to-batch allocations in genomics experiments. BMC Genom. 2012, 13, 689. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Covariates of the treatment group and the control group under the CRE (a,b) and the MODE (c,d).

Figure 2. Covariate imbalances, measured based on the Mahalanobis distance and the energy distance, of various experiments under different values of n and p.

Table 1. Empirical MSEs (

\times 10^{4}

) for difference-in-means estimators based on 1000 replications across various experiments. The columns labeled PR indicate the percentage reduction in MSE of the MODE compared to the CRE. The rows labeled C1–C4 represent the four specifications of the potential outcomes model.

Table 1. Empirical MSEs (

\times 10^{4}

) for difference-in-means estimators based on 1000 replications across various experiments. The columns labeled PR indicate the percentage reduction in MSE of the MODE compared to the CRE. The rows labeled C1–C4 represent the four specifications of the potential outcomes model.

	CRE	ReM	BODE	MODE	PR	CRE	ReM	BODE	MODE	PR
	$p = 5, n = 250$					$p = 5, n = 1000$
C1	620	254	124	57	91%	129	62	36	8	94%
C2	623	347	211	142	77%	148	77	55	30	80%
C3	1212	652	389	149	88%	272	165	101	18	93%
C4	1496	771	451	235	84%	301	171	119	38	87%
	$p = 10, n = 500$					$p = 10, n = 2000$
C1	508	311	273	70	86%	118	68	70	11	91%
C2	539	374	312	112	79%	133	88	76	22	83%
C3	1374	1087	904	226	84%	321	236	213	34	90%
C4	1466	1117	916	265	82%	306	253	217	44	86%

Table 2. Average Kolmogorov–Smirnov values (

\times 10^{2}

) for potential and observed outcomes from the treatment group, based on 1000 replications of various experiments. The rows labeled C1–C4 correspond to the four specifications of the potential outcomes model.

Table 2. Average Kolmogorov–Smirnov values (

\times 10^{2}

) for potential and observed outcomes from the treatment group, based on 1000 replications of various experiments. The rows labeled C1–C4 correspond to the four specifications of the potential outcomes model.

	CRE	ReM	BODE	MODE	CRE	ReM	BODE	MODE
	$p = 5, n = 250$				$p = 5, n = 1000$
C1	5.07	4.47	4.02	3.30	2.57	2.39	2.03	1.56
C2	5.29	5.00	4.68	4.41	2.64	2.49	2.40	2.38
C3	5.45	4.62	3.92	3.14	2.61	2.35	2.12	1.61
C4	5.21	4.67	4.46	4.30	2.70	2.47	2.34	2.21
	$p = 10, n = 500$				$p = 10, n = 2000$
C1	3.74	3.44	3.19	2.93	1.92	1.70	1.58	1.34
C2	3.80	3.65	3.39	3.47	1.93	1.72	1.73	1.69
C3	3.68	3.38	3.21	2.88	2.00	1.67	1.55	1.37
C4	3.68	3.62	3.38	3.22	1.88	1.83	1.73	1.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, S.; Liu, S.; Zhou, Y. MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates. Entropy 2024, 26, 1023. https://doi.org/10.3390/e26121023

AMA Style

Xu S, Liu S, Zhou Y. MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates. Entropy. 2024; 26(12):1023. https://doi.org/10.3390/e26121023

Chicago/Turabian Style

Xu, Shaohua, Songnan Liu, and Yongdao Zhou. 2024. "MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates" Entropy 26, no. 12: 1023. https://doi.org/10.3390/e26121023

APA Style

Xu, S., Liu, S., & Zhou, Y. (2024). MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates. Entropy, 26(12), 1023. https://doi.org/10.3390/e26121023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates

Abstract

1. Introduction

1.1. Related Works

1.2. Main Contributions

2. Problem Setups

3. Minimax Optimal Deterministic Experiments

3.1. Expressions for the Maximum Risk

3.2. MODE and Asymptotic MODE

3.3. Algorithm for Generating Asymptotic MODE

4. Theoretical Comparison with Randomized Experiments

4.1. Comparison with Completely Randomized Experiments

4.2. Comparison with Re-Randomized Experiments

5. Simulations

5.1. Covariate Imbalance

5.2. Estimation Precision

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. The Randomization Variance for ${\hat{τ}}_{c r e}$

Appendix B. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

MODE: Minimax Optimal Deterministic Experiments for Causal Inference in the Presence of Covariates

Abstract

1. Introduction

1.1. Related Works

1.2. Main Contributions

2. Problem Setups

3. Minimax Optimal Deterministic Experiments

3.1. Expressions for the Maximum Risk

3.2. MODE and Asymptotic MODE

3.3. Algorithm for Generating Asymptotic MODE

4. Theoretical Comparison with Randomized Experiments

4.1. Comparison with Completely Randomized Experiments

4.2. Comparison with Re-Randomized Experiments

5. Simulations

5.1. Covariate Imbalance

5.2. Estimation Precision

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. The Randomization Variance for τ ^ c r e

Appendix B. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. The Randomization Variance for ${\hat{τ}}_{c r e}$