On Lower Bounds for Statistical Learning Theory

Loh, Po-Ling

doi:10.3390/e19110617

Open AccessFeature PaperArticle

On Lower Bounds for Statistical Learning Theory

by

Po-Ling Loh

Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engineering Drive, Madison, WI 53706, USA

Entropy 2017, 19(11), 617; https://doi.org/10.3390/e19110617

Submission received: 7 September 2017 / Revised: 27 October 2017 / Accepted: 14 November 2017 / Published: 15 November 2017

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Download Versions Notes

Abstract

:

In recent years, tools from information theory have played an increasingly prevalent role in statistical machine learning. In addition to developing efficient, computationally feasible algorithms for analyzing complex datasets, it is of theoretical importance to determine whether such algorithms are “optimal” in the sense that no other algorithm can lead to smaller statistical error. This paper provides a survey of various techniques used to derive information-theoretic lower bounds for estimation and learning. We focus on the settings of parameter and function estimation, community recovery, and online learning for multi-armed bandits. A common theme is that lower bounds are established by relating the statistical learning problem to a channel decoding problem, for which lower bounds may be derived involving information-theoretic quantities such as the mutual information, total variation distance, and Kullback–Leibler divergence. We close by discussing the use of information-theoretic quantities to measure independence in machine learning applications ranging from causality to medical imaging, and mention techniques for estimating these quantities efficiently in a data-driven manner.

Keywords:

machine learning; minimax estimation; community recovery; online learning; multi-armed bandits; channel decoding; threshold phenomena

1. Introduction

Statistical learning theory refers to the rigorous mathematical analysis of machine learning algorithms [1,2]. On one hand, it is desirable to derive error bounds for the performance of particular machine learning algorithms under appropriate assumptions on the probabilistic models used to generate the data. On the other hand, it is important to understand the fundamental limitations of any algorithmic procedure, which may be influenced by quantities such as the sample size, signal-to-noise ratio, or smoothness of an ambient function space. Whereas statistical techniques based on concentration inequalities and empirical process theory may often be employed to derive rates of convergence of specific estimators to the underlying parameters of a data-generating distribution, the somewhat trickier problem of quantifying the best possible performance of any learning procedure requires tools from information theory.

A general approach is to relate the machine learning task at hand to an appropriate channel decoding problem, where the output corresponds to the observed data and the input corresponds to a cleverly constructed subset of the parameter space. For estimation problems, the key observation is that, if the underlying parameters may be estimated closely (i.e., on the level of discretization of the subset of parameter space), decoding may be performed accurately with high probability. The hardness of the decoding problem may in turn be quantified using techniques in information theory [3], leading to a lower bound on the estimation error. This strategy has been applied successfully to a diverse array of statistical estimation problems, including parametric and nonparametric regression, structure estimation for graphical models, covariance matrix estimation, and dimension reduction methods such as principal component analysis [4,5,6,7,8,9]. Section 2 discusses the method and several illustrative examples in greater detail.

Although some classes of machine learning problems may not be analyzed directly using these methods, alternative approaches involving related information-theoretic concepts may be employed. In Section 3 and Section 4, we consider the problems of community recovery and online learning, which are both active areas of research in machine learning. Our discussion of weak recovery in the community estimation setting is similar to the framework described in Section 2, but since the loss function used to quantify the estimation error incurred by the algorithm is more complicated, a more careful analysis must be conducted to derive sharp lower bounds. The theory characterizing the regimes in which exact recovery is possible are of a somewhat different flavor, but the emergence of sharp thresholds may again be related to Shannon coding theory. Section 4, concerning online learning for multi-armed bandits, provides a still different setting, where the goal is to bound a quantity known as regret. Although this is a radically different goal from bounding estimation error, the techniques used to obtain lower bounds for multi-armed bandits nonetheless include components of reductions to channel decoding problems: The key is to relate the performance of a learning algorithm to a problem of distinguishing between pairs of parameter assignments corresponding to underlying reward distributions that are close in parameter space.

We include proof sketches for the stated theorems in the main text of the paper, with references to resources where the reader can find more detailed proofs and additional background material. Although the discussion of each problem setting is necessarily brief, given the broad scope of this paper, we hope that our survey will convey the high-level ideas involved in applying information-theoretic tools to derive lower bounds for some statistical machine learning problems in a clear, concise manner. We have intentionally selected a diverse variety of problem settings in order to help the reader compare and contrast different approaches for obtaining lower bounds and identify the common threads underlying all the strategies.

2. Statistical Estimation

We begin by discussing an approach based on minimax theory for statistical estimation problems [10]. Our goal is a lower bound on the following quantity, known as the minimax risk:

inf_{\hat{θ}} sup_{P \in P} E_{X \sim P} [ℓ (\hat{θ} (X), θ (P))],

(1)

where ℓ is a symmetric loss function. Here,

P

denotes a class of data-generating distributions and

θ : P \to Ω

is a functional that maps each distribution in

P

to a parameter in the metric space

Ω

. The expectation in expression (1) is taken with respect to data from a particular distribution

P \in P

, and the infimum is then taken over all possible estimators

\hat{θ} = \hat{θ} (X)

computed from the data. In other words, quantity (1) captures the worst-case risk of the best possible estimator. Whereas statistical analysis of a specific estimator can provide an upper bound on the minimax risk, tools from information theory may be used to derive a lower bound on the same quantity. Throughout this section, we will restrict our attention to the setting where

ℓ = Φ \circ ρ

, for a metric

ρ

and monotonically increasing function

Φ : [0, \infty) \to [0, \infty)

. For instance, Example 2 below will discuss the setting where

ρ

is the

L_{2}

-distance in a function space and

Φ (t) = t^{2}

, so ℓ is the squared

L_{2}

-distance.

The basic idea is to transform an estimation problem into a decoding problem, in which we wish to infer the correct message from a discrete set of messages, corresponding to a collection of parameters. The estimation problem must be at least as hard as the decoding problem, since, if the parameters in the discrete set are appropriately separated, accurate parameter estimation implies accurate decoding. In Section 2.1, we present a general technique based on Fano’s inequality, which expresses the probability of error for the decoding in terms of the mutual information between the input (parameters in the discrete subset) and output (observed data). Section 2.2 and Section 2.3 then provide methods for bounding the mutual information and discuss applications to concrete statistical estimation settings. We will follow the convention of Cover and Thomas [3] and take all logarithms with respect to base 2 in our definitions of entropy and mutual information; analogous results hold when logarithms are taken with respect to base e.

2.1. Fano’s Method

We begin by describing the general approach for deriving lower bounds. The key idea consists of relating the estimation problem to a decoding problem, and then using Fano’s inequality to lower-bound the probability of error for the decoding problem. Recall the definition of the mutual information:

I (Y; X) = H (Y) - H (Y | X) .

(2)

The main result relates the minimax risk to the mutual information between observations and the data-generating distribution.

Theorem 1.

Suppose

{P_{1}, \dots, P_{M}} \subseteq P

satisfy

ρ (θ (P_{i}), θ (P_{j})) \geq 2 δ

, for all

i \neq j

. Then

inf_{\hat{θ}} sup_{P \in P} E_{X \sim P} [ℓ (\hat{θ} (X), θ (P))] \geq Φ (δ) (1 - \frac{I (Y; X) - 1}{{log}_{2} M}),

where Y is distributed uniformly on

{1, \dots, M}

and the conditional distribution of X given Y is defined by

X ∣ {Y = j} \sim P_{j}

.

Proof (sketch).

We begin by writing

sup_{P \in P} E_{X \sim P} [ℓ (\hat{θ} (X), θ (P))] \geq \frac{1}{M} \sum_{i = 1}^{M} E_{X \sim P_{i}} [ℓ (\hat{θ} (X), θ (P_{i}))] .

(3)

If we define the decision rule

ψ (X) = arg min_{1 \leq j \leq M} ℓ (θ_{j}, \hat{θ} (X)),

where we break ties arbitrarily, we may verify that

E_{X \sim P_{i}} [ℓ (\hat{θ} (X), θ (P_{i}))] \overset{(a)}{\geq} Φ (δ) P_{i} (ℓ (\hat{θ} (X), θ (P_{i})) \geq Φ (δ)) \overset{(b)}{\geq} Φ (δ) P_{i} (ψ (X) \neq i),

for each

1 \leq i \leq M

. Inequality

(a)

is a direct application of Markov’s inequality, and inequality

(b)

follows from the fact that if

ℓ (\hat{θ}, θ_{i}) < Φ (δ)

, or equivalently,

ρ (\hat{θ}, θ_{i}) < δ

, then

ρ (\hat{θ}, θ_{j}) \geq ρ (θ_{i}, θ_{j}) - ρ (θ_{i}, \hat{θ}) > 2 δ - δ > ρ (\hat{θ}, θ_{i}), \forall j \neq i,

implying that

ψ (X) = i

.

Now, recall the statement of Fano’s inequality:

Lemma 1

(Fano’s inequality [3]) For any estimator

\hat{Y}

of Y such that

Y \to X \to \hat{Y}

forms a Markov chain, it holds that

P (\hat{Y} \neq Y) \geq \frac{H (Y | X) - 1}{{log}_{2} | Y |},

where

Y

is the range of Y.

Applying Lemma 1 with

\hat{Y} = ψ (X)

and writing out the error probability explicitly, we obtain

\frac{1}{M} \sum_{i = 1}^{M} P_{i} (ψ (X) \neq i) \geq \frac{H (Y | X) - 1}{{log}_{2} M} = \frac{{log}_{2} M - I (Y; X) - 1}{{log}_{2} M},

(4)

where the equality follows from relation (2) and the fact that Y has a uniform distribution. Combining inequalities (3) and (4) establishes the desired result. ☐

In the following subsections, we describe two methods for upper-bounding the mutual information term

I (Y; X)

appearing in Theorem 1, yielding a lower bound on the minimax risk.

2.2. Local Packings

The first method applies the convexity of the Kullback-Leibler (KL) divergence to obtain an upper bound on

I (Y; X)

in terms of pairwise KL divergences. We have the following lemma:

Lemma 2.

Let X and Y be defined as in Theorem 1. Then,

I (Y; X) \leq \frac{1}{M^{2}} \sum_{1 \leq i, j \leq M} D_{K L} (P_{i} ∥ P_{j}) .

Proof.

We can check that

I (Y; X) = \frac{1}{M} \sum_{i = 1}^{M} D_{K L} (P_{i} ∥ \bar{P}),

where

\bar{P} = \frac{1}{M} \sum_{j = 1}^{M} P_{j}

is a mixture distribution. By the convexity of the KL divergence, we then have

I (Y; X) \leq \frac{1}{M} \sum_{i = 1}^{M} \frac{1}{M} \sum_{j = 1}^{M} D_{K L} (P_{i} ∥ P_{j}),

which is the desired expression. ☐

This bounding technique is known as a “local packing”, since the trick is to design an appropriate set

{P_{1}, \dots, P_{M}}

such that the parameters

θ (P_{i})

are

2 δ

-separated, while the pairwise KL divergences between the data-generating distributions are relatively small.

Example 1

(High-dimensional linear regression). Suppose we have observation pairs

{(x_{i}, y_{i})}_{i = 1}^{n}

from a linear model:

y_{i} = x_{i}^{T} β^{*} + w_{i},

where

x_{i} \in R^{p}

and

w_{i} \sim N (0, σ^{2})

is i.i.d. noise, and

β^{*} \in R^{p}

is the unknown parameter vector. We assume that

p > n

, but

β^{*}

is known to have at most s nonzero values, where

s \leq n

. More precisely, if

B_{q} (r)

denotes the ball of radius r in the

ℓ_{q}

norm, we are interested in characterizing the minimax risk over the parameter space

B_{0} (s) \cap B_{2} (1) = \{β \in R^{p} {: ∥ β ∥}_{0} \leq s, {∥ β ∥}_{2} \leq 1\} .

For any fixed parameter

δ > 0

, it is possible to construct a subset of parameters

{β_{1}, \dots, β_{M}}

lying in the parameter space such that

δ \leq ∥ β_{j} - β_{k} ∥_{2} \leq 2 δ \sqrt{2}

for all

1 \leq j < k \leq M

and

log M \geq \frac{s}{2} log (\frac{p - s}{s / 2})

, essentially by rescaling a packing of the subset of

{- 1, 0, 1}^{p}

of s-sparse vectors such that the Hamming distance between any two elements is at least

\frac{s}{2}

[4,11]. Furthermore, we may compute the pairwise KL divergences in terms of the squared

ℓ_{2}

-norm between parameter vectors, so

D_{K L} (P_{j} ∥ P_{k}) = \frac{1}{2 σ^{2}} {∥ X (β_{j} - β_{k}) ∥}_{2}^{2} \leq \frac{4 n δ^{2} γ_{2 s}^{2}}{σ^{2}},

where

γ_{2 s} = {sup}_{β \in B_{0} (2 s)} \frac{{∥ X β ∥}_{2}}{\sqrt{n} {∥ β ∥}_{2}}

. Note that

P_{j}

and

P_{k}

refer to the conditional distributions of the

y_{i}

’s given the

x_{i}

’s for this example, so we are assuming the design matrix is fixed. Applying Theorem 1 and Lemma 2 with ρ equal to the

ℓ_{2}

-distance and Φ equal to the identity, we therefore have

inf_{\hat{β}} sup_{β \in B_{0} (s) \cap B_{2} (1)} E [∥ \hat{β} {- β ∥}_{2}] \geq \frac{δ}{2} (1 - \frac{\frac{n δ^{2} γ_{2 s}^{2}}{σ^{2}} - 1}{\frac{s}{2} log (\frac{p - s}{s / 2})}) .

Taking

δ^{2} ≍ \frac{σ s log (\frac{p - s}{s / 2})}{γ_{2 s}^{2} n}

and assuming that the problem dimensions satisfy

n \geq C s log p

, we then obtain a lower bound of the form

inf_{\hat{β}} sup_{β \in B_{0} (s) \cap B_{2} (1)} E [∥ \hat{β} {- β ∥}_{2}] \geq \frac{δ}{4} \geq \frac{c}{γ_{2 s}^{2}} \sqrt{\frac{s}{n} log (\frac{p - s}{s})} .

In the case of the

ℓ_{2}

-loss, the Lasso estimator achieves the risk expression in the lower bound (up to constant factors), implying that it is a rate-optimal estimator [4]. Similar bounds on the minimax risk may be derived when the norms appearing in the loss function and/or parameter space are replaced by a general

ℓ_{q}

-norm [4,12].

2.3. Metric Entropy

The second method for bounding

I (Y; X)

, due to Yang and Barron [13], is based on the metric entropy of the parameter space. Recall the notion of the

ϵ

-covering number of a set in a metric space, which is the minimum number of

ϵ

-balls required to cover the set. The logarithm of the covering number is also known as the metric entropy. In particular, we are interested in the quantity

log N_{K L} (ϵ; P)

, defined by

N_{K L} (ϵ; P) = min \{N : \exists {Q_{1}, \dots, Q_{N}} \subseteq P s . t . min_{1 \leq i \leq N} \sqrt{D_{K L} (P, Q_{i})} \leq ϵ, \forall P \in P\},

which denotes the

ϵ

-covering number of

P

, where distances are measured with respect to the square root KL divergence. We have the following bound:

Lemma 3.

Let X and Y be defined as in Theorem 1. Then,

I (Y; X) \leq inf_{ϵ > 0} \{ϵ^{2} + log N_{K L} (ϵ; P)\} .

Proof (sketch).

Suppose

{Q_{1}, \dots, Q_{N}}

is an

ϵ

-cover of

P

with respect to the square root KL divergence. Letting

\bar{P} = \frac{1}{M} \sum_{i = 1}^{M} P_{i}

and

\bar{Q} = \frac{1}{N} \sum_{j = 1}^{N} Q_{j}

, we can check that

I (Y; X) = \frac{1}{M} \sum_{i = 1}^{M} D_{K L} (P_{i} ∥ \bar{P}) \leq \frac{1}{M} \sum_{i = 1}^{M} D_{K L} (P_{i} ∥ \bar{Q}),

where the inequality holds because

\bar{P}

minimizes the average KL divergence with respect to the second argument. Furthermore, we know that there exists some

Q_{n}

such that

D_{K L} (P_{i} ∥ Q_{n}) \leq ϵ^{2}

, implying that

\begin{matrix} D_{K L} (P_{i} ∥ \bar{Q}) & = \int log \frac{d P_{i} (X)}{d \bar{Q} (X)} d P_{i} (X) \leq \int log \frac{d P_{i} (X)}{\frac{1}{N} d Q_{n} (X)} d P_{i} (X) = D_{K L} (P_{i} ∥ Q_{n}) + log N \\ \leq ϵ^{2} + log N_{K L} (ϵ; P) . \end{matrix}

Since the above inequality holds for all

ϵ > 0

, we may take an infimum over

ϵ

to obtain the stated bound. ☐

As an example of the above technique, we consider the problem of nonparametric regression. Note that the following example shows that the general machinery developed above, though described in terms of parameter estimation, may be applied to nonparametric settings involving function estimation, as well.

Example 2.

(Nonparametric regression) Suppose we observe i.i.d. pairs

{(x_{i}, y_{i})}_{i = 1}^{n}

, where

y_{i} = f^{*} (x_{i}) + w_{i},

x_{i} \sim U n i f o r m [0, 1]

,

w_{i} \sim N (0, 1)

, and

x_{i}

is independent of

w_{i}

. We also assume that

f^{*}

belongs to the function class

F_{s}

, for a positive integer s, defined as the set of all continuous functions f on

[0, 1]

satisfying the following properties:

(i): f is differentiable $s - 1$ times on $(0, 1),$
(ii): ${sup}_{0 \leq x \leq 1} | f^{(k)} (x) | \leq 1$ , for all $k = 0, 1, \dots, s - 1$ , where $f^{(0)} (x) : = f (x),$
(iii): $f^{(s - 1)}$ is 1-Lipschitz on $(0, 1)$ .

We derive lower bounds on the minimax risk of estimating

f^{*}

when ℓ is the squared

L_{2}

-distance, defined by

ℓ (f, g) = \int_{0}^{1} {(f (x) - g (x))}^{2} d x .

Hence, we will take

Φ (t) = t^{2}

and ρ equal to the

L_{2}

-distance. Let

P

denote the set of joint distributions of

(x, y)

generated by the class

F_{s}

. By standard results on the metric entropy of function classes [14,15], we have the bound

c {(\frac{1}{ϵ})}^{1 / s} \leq log N_{2} (ϵ; F_{s}) \leq C {(\frac{1}{ϵ})}^{1 / s},

where

log N_{2} (ϵ; F_{s})

denotes the metric entropy of

F_{s}

with respect to the

L_{2}

-distance. Furthermore, for any

δ > 0

, there exists a δ-packing

{f_{1}, \dots, f_{M}}

of

F_{s}

in the

L_{2}

-metric such that

log M = c^{'} {(\frac{1}{δ})}^{1 / s}

. For two functions

f, g \in F_{s}

, we may compute the KL divergence between the corresponding distributions

P_{f}, P_{g} \in P

:

D_{K L} (P_{f} ∥ P_{g}) = \frac{n}{2} \cdot {∥ f - g ∥}_{2}^{2} .

Hence, it follows that

log N_{K L} (ϵ; P) \leq log N_{2} (ϵ \sqrt{\frac{2}{n}}; F_{s}) \leq C {(\frac{1}{ϵ} \sqrt{\frac{n}{2}})}^{1 / s} .

Minimizing the bound obtained from Lemma 3 with respect to ϵ, we obtain

ϵ^{*} = C^{'} n^{\frac{1}{4 s + 2}}

, and plugging back into Theorem 1, we obtain the lower bound

δ^{2} (1 - \frac{C^{″} n^{1 / (4 s + 2)}}{{(1 / δ)}^{1 / s}}) .

Taking

δ ≍ {(\frac{1}{n})}^{s / (4 s + 2)}

then yields the bound

inf_{\hat{f}} sup_{f^{*} \in F_{s}} E_{f^{*}} [∥ \hat{f} - f^{*} ∥_{2}^{2}] \geq c^{'} {(\frac{1}{n})}^{s / (2 s + 1)} .

A matching upper bound may be derived using local weighted polynomial regression [16], so the minimax risk is

Θ (n^{- s / (2 s + 1)})

.

3. Community Recovery

Another area of machine learning that has recently received a substantial amount of attention concerns recovering communities based on node connectivity in a network. A popular probabilistic model is known as the stochastic block model (SBM). In the simplest form of the model, parametrized by

(n, K, p, q)

, the graph has nodes

{1, \dots, n}

partitioned into K communities. Let the community label of node i be denoted by

σ (i)

. The edge set E of the random graph G is then constructed in the following manner: each edge

(i, j)

is generated independently from all others, with probability

P ((i, j) \in E) = \{\begin{matrix} p, & if σ (i) = σ (j), \\ q, & if σ (i) \neq σ (j) . \end{matrix}

The goal is to partition the n nodes into the underlying communities based on observing the graph G.

In order to measure the performance of an algorithm, we consider the loss function

r (\hat{σ}, σ) = \frac{1}{n} min_{τ \in S_{K}} d_{H} (\hat{σ}, τ \circ σ) .

Here, the estimator

\hat{σ} : {1, \dots, n} \to {1, \dots, K}

corresponds to a partitioning of the nodes into K communities, and

d_{H}

denotes the Hamming distance between assignments. Furthermore, we take the minimum over all permutations

S_{K}

of the community labels. Hence,

r (\hat{σ}, σ)

is the proportion of incorrectly labeled nodes (for the optimal labeling of partitions). We will focus our discussion on the setting where K is fixed, but p and q may vary with n; generalizations exist in the literature where K is allowed to grow with n, as well. We are interested in the behavior of various algorithms as

n \to \infty

.

In the following two subsections, we discuss the popular notions of weak recovery and exact recovery. The algorithm

\hat{σ}

achieves weak recovery if

E [r (\hat{σ}, σ)] \to 0

(i.e., the expected fraction of misclassified nodes tends to 0 as

n \to \infty

), and achieves exact recovery if

r (\hat{σ}, σ) = 0

. For a more complete description of current work on stochastic block models, see the extensive survey paper by Abbe [17].

3.1. Weak Recovery

Analogous to the setting discussed in Section 2, we may derive bounds on the minimax risk

inf_{\hat{σ}} sup_{σ \in Σ (n, K)} E [r (\hat{σ}, σ)],

where

Σ (n, K)

is an appropriate class of underlying community labelings. We state and prove a result for approximately equal-sized communities in the limit as

n \to \infty

, so

Σ (n, K)

is the set of all labelings

σ

such that

| {i : σ (i) = k} | = (1 + o (1)) \frac{n}{K}

, for all

1 \leq k \leq K

.

The main result is the following [18]:

Theorem 2.

Suppose

p = \frac{a}{n}

and

q = \frac{b}{n}

, and suppose

\frac{n I}{K} \to \infty

, where

I = - 2 log (\sqrt{\frac{a}{n}} \sqrt{\frac{b}{n}} + \sqrt{1 - \frac{a}{n}} \sqrt{1 - \frac{b}{n}}) .

(5)

A lower bound on the minimax risk of community estimation is given by

inf_{\hat{σ}} sup_{σ \in Σ (n, K)} E [r (\hat{σ}, σ)] \geq exp (- (1 + o (1)) \frac{n I}{K}) .

Proof (sketch).

The core of the approach bears similarity to the method for obtaining lower bounds for estimation, in the sense that we construct a subset

Σ^{L}

of the parameter space corresponding to “messages”, which we wish to recover via an appropriate decoding strategy. In the case when

K = 2

(and n is even), the subset

Σ^{L}

consists of all partitions of the nodes into equal-sized communities and communities of size

(\frac{n}{2} + 1, \frac{n}{2} - 1)

. We focus on the case

K = 2

in the present proof sketch to avoid technical complications.

The proof is somewhat more involved than the strategies outlined in Section 2, however, since the unknown quantity to be estimated is a set of discrete labelings and the loss function is defined with respect to an optimal permutation. The first step is to lower-bound the minimax risk by the average risk over the class

Σ^{L}

. Furthermore, a more technical argument shows that we may just examine the average local risk defined with respect to a single node in the graph:

\begin{matrix} inf_{\hat{σ}} sup_{σ \in Σ (n, K)} E [r (\hat{σ}, σ)] & \geq inf_{\hat{σ}} sup_{σ \in Σ^{L}} E_{σ} [r (\hat{σ}, σ)] \\ \geq inf_{\hat{σ}} \frac{1}{| Σ^{L} |} \sum_{σ \in Σ^{L}} E [r (\hat{σ}, σ)] \\ = inf_{\hat{σ}} \frac{1}{| Σ^{L} |} \sum_{σ \in Σ^{L}} E [r_{1} (\hat{σ}, σ)], \end{matrix}

where

r_{1}

is the local loss function defined with respect to node 1, which is the fraction of optimal permutations of community assignments that incorrectly classify node 1. The next step is to lower-bound the local risk (uniformly over all choices of

σ \in Σ^{L}

) using the minimum risk of a binary hypothesis testing problem, where the two hypotheses correspond to the possible assignments of node 1 as a member of the first or second community. In particular, we have the following inequality, which holds for each

σ

:

E [r_{1} (\hat{σ}, σ)] \geq c P (\sum_{i = 1}^{n / 2} X_{i} \geq \sum_{j = 1}^{n / 2} Y_{j}),

where

X_{i} \overset{i . i . d .}{\sim} B e r n o u l l i (\frac{b}{n})

and

Y_{j} \overset{i . i . d .}{\sim} B e r n o u l l i (\frac{a}{n})

are independent random variables. Standard techniques involving large deviation inequalities allow us to lower-bound the latter probability, thus yielding the overall lower bound appearing in the theorem. ☐

As demonstrated by Zhang and Zhou [18], the lower bound on the risk appearing in Theorem 2 may be achieved using a form of penalized likelihood estimation. A computationally feasible procedure was subsequently provided in Gao et al. [19].

Remark 1.

The quantity I appearing in Equation (5) is the Renyi divergence of order

\frac{1}{2}

between a Bernoulli

(\frac{a}{n})

and Bernoulli

(\frac{b}{n})

distribution. In fact, these results generalize to the case of non-binary edge weights, and the Renyi divergence of order

\frac{1}{2}

also appears in the minimax rates for estimation in weighted stochastic block models [20]. Furthermore, if the communities are not all of equal size, alternative divergence functions appear in the error exponent [21,22]. Finally, note that the regime where

p = \frac{a}{n}

and

q = \frac{b}{n}

, with

a, b = Θ (1)

, corresponds to the threshold at which giant components emerge in the network [23]. Theorem 2 allows a and b to scale arbitrarily with n, provided

\frac{n I}{K} \to \infty

, which will not hold if

a, b ≪ n

.

3.2. Exact Recovery

Information-theoretic arguments may also be used to establish lower bounds for exact recovery in stochastic block models, which corresponds to correct classification of every single node (up to permutation the of community labels). We present a result, due to Abbe et al. [24], that provides lower bounds for exact recovery in the case of two equal-sized communities.

We have the following result:

Theorem 3.

Let

p = \frac{a log n}{n}

and

b = \frac{q log n}{n}

, where

a > b \geq 0

. If

{(\sqrt{a} - \sqrt{b})}^{2} < 2

, then for sufficiently large n, the maximum likelihood estimator fails in recovering the communities with probability bounded away from 0:

lim inf_{n \to \infty} P (r ({\hat{σ}}_{M L E}, σ) \neq 0) > 0 .

Proof (sketch).

We denote the two communities by A and B. Let F be the event that the maximum likelihood estimator fails in performing exact recovery, and let

\begin{matrix} F_{A} & = \{\exists i \in A : i is connected to more nodes in B than in A\}, \\ F_{B} & = \{\exists j \in B : j is connected to more nodes in A than in B\} . \end{matrix}

By symmetry, we have

P (F_{A}) = P (F_{B})

. Furthermore, note that

F_{A} \cap F_{B} \subseteq F,

since if both

F_{A}

and

F_{B}

were to occur simultaneously, swapping the labels of the nodes i and j would lead to a higher value of the likelihood than in the case of correct labeling. In particular, this implies that

P (F) \geq P (F_{A} \cap F_{B}) \geq P (F_{A}) + P (F_{B}) - 1 = 2 P (F_{A}) - 1 .

(6)

Let

H \subseteq A

denote a fixed subset with

| H | = ⌊\frac{n}{{log}^{3} (n)}⌋

, and define the event

F_{H} = \{\exists j \in H s . t . E (j, A ∖ H) + \frac{log n}{log log n} \leq E (j, B)\},

where

E (j, C)

denotes the number of edges between j and the nodes in C. Note that, if event

F_{H}

occurs and all nodes in H are connected to at most

\frac{log n}{log log n}

other nodes in H, then event

F_{A}

must occur. Furthermore, one can show that, with high probability, every node in H is connected to at most

\frac{log n}{log log n}

other nodes in H. Hence,

P (F_{A}) \geq P (F_{H}) + o (1) .

(7)

It remains to derive a lower bound on

P (F_{H})

. For

j \in H

, let

F_{H}^{(j)} = \{E (j, A ∖ H) + \frac{log n}{log log n} \leq E (j, B)\},

and note that the

F_{H}^{(j)}

’s are independent. Hence,

P (F_{H}) = P (⋃_{j \in H} F_{H}^{(j)}) = 1 - \prod_{j \in H} (1 - P (F_{H}^{(j)})) .

Straightforward techniques for bounding sums of independent Bernoulli random variables show that

P (F_{H}^{(j)}) > \frac{log (4) {log}^{3} (n)}{n}

for each j, from which we can conclude that

P (F_{H}) \geq 1 - {(1 - \frac{log (4) {log}^{3} (n)}{n})}^{⌊\frac{n}{{log}^{3} (n)}⌋} = 1 - \frac{1}{4} + o (1) .

(8)

Combining inequalities (6)–(8) then yields the desired result. ☐

Note that for any other estimator

\hat{σ}

, we have

P (r ({\hat{σ}}_{M L E}, σ) = 0) \geq P (r (\hat{σ}, σ) = 0) .

Hence, Theorem 3 also implies that

lim inf_{n \to \infty} P (r (\hat{σ}, σ) \neq 0) \geq P (r ({\hat{σ}}_{M L E}, σ) \neq 0) \geq 0 .

In fact, a converse of Theorem 3 holds, as well:

Theorem 4.

Under the same conditions as in Theorem 3, suppose instead that

{(\sqrt{a} - \sqrt{b})}^{2} > 2

. Then, the maximum likelihood estimator succeeds in recovering the communities with probability tending to 1:

lim_{n \to \infty} P (r ({\hat{σ}}_{M L E}, σ) = 0) = 1 .

Since the focus of this paper is to establish lower bounds, we refer the reader to Abbe [24] for the proof of Theorem 4, which proceeds by direct calculation. An extension of Theorems 3 and 4 for weighted stochastic block models may be found in Jog and Loh [25].

Remark 2.

The threshold behavior described in Theorems 3 and 4 is perhaps not surprising in light of known threshold behavior in Shannon coding theory, and the connections between each of the statistical learning tasks and the problem of decoding on a discrete alphabet after passage through a noisy channel. Indeed, the community recovery problem has been cast in information-theoretic terminology as decoding in a “graphical channel" [26]. On the other hand, the coding scheme is fixed according to the stochastic block model, whereas Shannon theory allows one to design an optimal encoding scheme to achieve channel capacity. See also the paper by Chen et al. [27], and the derivation of similar types of sharp threshold behavior in submatrix localization problems [28,29]. Finally, we note that the scaling

p = \frac{a log n}{n}

and

q = \frac{b log n}{n}

, when

a, b = Θ (1)

, corresponds to the threshold for the graph to have isolated vertices with probability tending to 1 [23]. Indeed, it would be impossible to perform exact recovery with high probability in the presence of isolated vertices: flipping the community assignments of two isolated vertices belonging to the two different communities would not change the value of the likelihood.

4. Online Learning

We now shift our focus to sequential allocation problems. The setup we consider involves a series of actions taken by a player, using limited feedback about the environment based on his/her past actions. We study the setting of a multi-armed bandit, where each potential action of the player is associated with a reward distribution, but the player only observes the reward corresponding to his/her action on successive rounds. In the following two subsections, we will consider the cases of stochastic and adversarial bandits and obtain bounds on a quantity known as regret. More details on the setting and results may be found in Bubeck and Cesa-Bianchi [30] or Cesa-Bianchi and Lugosi [31].

4.1. Stochastic Bandits

We first analyze the setting of stochastic multi-armed bandits. On each round, the player may choose one of k different arms. Associated to arm j is a reward distribution

P_{θ_{j}}

, where

θ_{j} \in Θ

belongs to some parameter space. Furthermore, we assume that the reward distributions

{P_{θ_{j}}}_{j = 1}^{k}

remain fixed across all rounds. We use the notation

μ (θ)

to denote the mean of the distribution

P_{θ}

, and let

μ^{*} = {max}_{1 \leq j \leq k} μ (θ_{j})

denote the maximum expected reward.

Denote the sequence of actions chosen by the player as

(I_{1}, \dots, I_{n})

, where

I_{t} \in {1, \dots, k}

is the arm played at time t, and let

X_{I_{t}, t} \sim P_{θ_{I_{t}}}

denote the observed reward, which is an i.i.d. drawn from the distribution

P_{θ_{I_{t}}}

. Note that

I_{t}

may be a function of the previously observed reward sequence

(X_{I_{1}, 1}, \dots, X_{I_{t - 1}, t})

and may also involve additional randomization. We are interested in bounding a quantity known as the pseudo-regret, defined as

{\bar{R}}_{n} = n μ^{*} - E [\sum_{t = 1}^{n} X_{I_{t}, t}],

where we may also write

{\bar{R}}_{n} (θ_{1}, \dots, θ_{k})

to make the dependence on the reward distributions explicit. If the player employs a random strategy, the expectation is computed with respect to randomness in the sequence of actions

(I_{1}, \dots, I_{n})

, as well as randomness generated by draws from the reward distributions. In other words, the pseudo-regret measures the difference between the expected reward incurred by the player’s strategy and the expected reward incurred by playing the arm with maximum expected reward on every round.

Lai and Robbins [32] prove the following result. We omit some technical regularity conditions on the parameter space, such as denseness of the parameter space and continuity with respect to the KL divergence, in order to avoid cluttering the presentation.

Theorem 5.

Suppose that, for all pairs

θ_{1}, θ_{2} \in Θ

such that

μ (θ_{1}) > μ (θ_{2})

, we have

0 < D_{K L} (P_{θ_{2}} ∥ P_{θ_{1}}) < \infty

. Suppose a strategy satisfies

{\bar{R}}_{n} (θ_{1}, \dots, θ_{k}) = o (n^{α})

, for all

θ_{1}, \dots, θ_{k} \in Θ

and all

α > 0

. Then, for any

(θ_{1}, \dots, θ_{k}) \in Θ

, we have

lim inf_{n \to \infty} \frac{{\bar{R}}_{n} (θ_{1}, \dots, θ_{k})}{log n} \geq \sum_{j : μ_{j} < μ^{*}} \frac{μ^{*} - μ_{j}}{D_{K L} (P_{θ_{j}} ∥ P_{θ^{*}})},

where

θ^{*} \in arg {min}_{θ \in Θ} μ (θ)

and

μ_{j} = μ (θ_{j})

.

Proof (sketch).

We may write

{\bar{R}}_{n} = \sum_{j = 1}^{k} E [T_{j} (n)] ∆_{j},

where

∆_{j} = μ^{*} - μ_{j}

and

T_{j} (n) = \sum_{t = 1}^{n} 1 {I_{t} = j}

. The main step is to show that the inequality

E [T_{j} (n)] \geq \frac{log n}{D_{K L} (P_{θ_{j}} ∥ P_{θ^{*}})}, \forall j : μ_{j} < μ^{*}

(9)

holds for any strategy. Inequality (9) provides a lower bound on the expected number of pulls to any suboptimal arm (note that, as

P_{θ_{j}}

becomes further from

P_{θ^{*}}

, the two arms are easier to distinguish, so the expected number of pulls to the suboptimal arm can be smaller). We focus on proving inequality (9) for

j = 2

; the other cases are similar.

Consider two parameter vectors

θ = (θ_{1}, θ_{2}, \dots, θ_{k})

and

θ^{'} = (θ_{1}, θ_{2}^{'}, \dots, θ_{k})

, which differ only in the second coordinate. We further choose the parameters such that

\begin{matrix} μ_{1} > μ_{2} \geq μ_{3} \geq \dots \geq μ_{k}, \\ μ_{2}^{'} \geq μ_{1} > μ_{3} \geq \dots \geq μ_{k}, \end{matrix}

so the second arm is suboptimal in the first setting but optimal in the second. We will choose

θ_{2}^{'}

close to

θ_{1}

, so

D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}}) \approx D_{K L} (P_{θ_{2}} ∥ P_{θ_{1}}) = D_{K L} (P_{θ_{2}} ∥ P_{θ^{*}}) .

(The regularity conditions on the parameter space and reward distributions ensure that such a choice is possible.) The idea is that, since

P_{θ}

and

P_{θ^{'}}

are close, any strategy should pick roughly the same sequence of arms in both scenarios, but a strategy that performs well on

θ

will behave relatively poorly on

θ^{'}

(and vice versa), since the ordering of arms according to optimality is different in the two settings. In particular, we will derive the following bound, relating the probabilities of pulling the second arm in each of the parameter settings:

P_{θ} (T_{2} (n) < a_{n}) \leq c_{n}^{'} P_{θ^{'}} (T_{2} (n) < a_{n}) + b_{n},

(10)

where

a_{n} = \frac{(1 - 3 α) log n}{D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}})}

, and we take

α < \frac{1}{3}

. We can show that

b_{n} = o (1)

since

P_{θ}

and

P_{θ^{'}}

are close, and that the right-hand probability is also

o (1)

, since arm 2 is optimal under

θ^{'}

.

For a fixed strategy, let

{X_{j, s}}_{\overset{1 \leq j \leq k}{1 \leq s \leq n}}

denote the rewards corresponding to various arm pulls. For

A \subseteq {T_{2} (n) = n_{2}}

, we have

\begin{matrix} P_{θ^{'}} (A) & = \int_{A} d P_{θ^{'}} (x) = \int_{A} \frac{d P_{θ^{'}} (x)}{d P_{θ} (x)} \cdot d P_{θ} (x) \\ = \int_{A} \prod_{s = 1}^{n_{2}} \frac{d P_{θ_{2}^{'}} (x_{2, s})}{d P_{θ_{2}} (x_{2, s})} d P_{θ} (x) \\ = \int_{A} e^{- L_{2} (x)} d P_{θ} (x), \end{matrix}

where we define

L_{2} (x) = \sum_{s = 1}^{T_{2} (n)} log \frac{d P_{θ_{2}} (x_{2, s})}{d P_{θ_{2}^{'}} (x_{2, s})}

. In particular, if

A \subseteq {T_{2} (n) = n_{2}, L_{2} (X) \leq c_{n}}

, we have

P_{θ^{'}} (A) \geq e^{- c_{n}} P_{θ} (A),

where we will take

c_{n} = (1 - 2 α) log n

. We may therefore write

\begin{matrix} P_{θ} (T_{2} (n) < a_{n}) & = P_{θ} (T_{2} (n) < a_{n}, L_{2} (X) \leq c_{n}) + P_{θ} (T_{2} (n) < a_{n}, L_{2} (X) > c_{n}) \\ \leq e^{c_{n}} P_{θ^{'}} (T_{2} (n) < a_{n}, L_{2} (X) \leq c_{n}) + P_{θ} (T_{2} (n) < a_{n}, L_{2} (X) > c_{n}) \\ \leq e^{c_{n}} P_{θ^{'}} (T_{2} (n) < a_{n}) + P_{θ} (T_{2} (n) < a_{n}, L_{2} (X) > c_{n}), \end{matrix}

which is inequality (10) with

c_{n}^{'} = e^{c_{n}}

and

b_{n} = P_{θ} (T_{2} (n) < a_{n}, L_{2} (X) > c_{n})

. Note that if

T_{2} (n) < a_{n}

, we have

L_{2} (X) < \sum_{s = 1}^{a_{n}} log \frac{d P_{θ_{2}} (X_{2, s})}{d P_{θ_{2}^{'}} (X_{2, s})}

, so

b_{n} \leq P_{θ} (\sum_{s = 1}^{a_{n}} log \frac{d P_{θ_{2}} (X_{2, s})}{d P_{θ_{2}^{'}} (X_{2, s})} > c_{n}) = o (1),

where the last equality follows from the fact that the rewards

{X_{2, s}}_{s = 1}^{a_{n}}

are i.i.d. and

\frac{1}{a_{n}} \sum_{s = 1}^{a_{n}} log \frac{d P_{θ_{2}} (X_{2, s})}{d P_{θ_{2}^{'}} (X_{2, s})} \overset{a . s .}{⟶} E_{θ} [log \frac{d P_{θ_{2}} (X_{2, s})}{d P_{θ_{2}^{'}} (X_{2, s})}] = D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}}) .

Finally, we bound

P_{θ^{'}} (T_{2} (n) < a_{n})

using Markov’s inequality:

P_{θ^{'}} (T_{2} (n) < a_{n}) = P_{θ^{'}} (n - T_{2} (n) \geq n - a_{n}) \leq \frac{E_{θ^{'}} [n - T_{2} (n)]}{n - a_{n}} = o (n^{α - 1}),

where the last equality follows from the fact that

a_{n} = o (n)

and the assumption on

{\bar{R}}_{n} (θ^{'})

. Altogether, we conclude that the right-hand side of inequality (10) is

o (1)

.

By another application of Markov’s inequality, we conclude that

E_{θ} [T_{2} (n)] \cdot \frac{D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}})}{log n} \geq P_{θ} (T_{2} (n) \geq \frac{log n}{D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}})}) > P_{θ} (T_{2} (n) > a_{n}) \to 1 .

Hence,

\frac{E_{θ} [T_{2} (n)]}{log n} \geq \frac{1}{D_{K L} (P_{θ_{2}} ∥ P_{θ_{2}^{'}})} \approx \frac{1}{D_{K L} (P_{θ_{2}} ∥ P_{θ^{*}})},

as wanted. ☐

Note that the assumption

{\bar{R}}_{n} (θ_{1}, \dots, θ_{k}) = o (n^{α})

implies that a sufficiently good player strategy exists for all choices of reward parameters. In particular, such a condition may be verified when the reward distributions are Bernoulli (e.g.,

P_{θ} \sim Bernoulli (θ)

). Then, we have

D_{K L} (P_{θ_{1}} ∥ P_{θ_{2}}) = θ_{1} log \frac{θ_{1}}{θ_{2}} + (1 - θ_{1}) log \frac{1 - θ_{1}}{1 - θ_{2}},

and combined with Theorem 5, we obtain the lower bound

lim inf_{n \to \infty} \frac{{\bar{R}}_{n} (θ_{1}, \dots, θ_{k})}{log n} \geq μ^{*} (1 - μ^{*}) \sum_{j : μ_{j} < μ^{*}} \frac{1}{μ^{*} - μ_{j}} .

A player strategy known as the Upper Confidence Bound (UCB) strategy may be shown to achieve this lower bound, up to constant factors [32,33].

Finally, we mention a non-asymptotic lower bound on the pseudo-regret that comes from the probably approximately correct (PAC) literature on bandits [34,35,36]:

Theorem 6.

In the case of Bernoulli reward distributions, there exist positive constants

{c_{i}}_{i = 1}^{5}

such that for all

k \geq 2

and

n \geq 1

, the pseudo-regret of any strategy satisfies

sup_{θ_{1}, \dots, θ_{k} \in [0, 1]} {\bar{R}}_{n} (θ_{1}, \dots, θ_{k}) \geq min \{c_{1} n, c_{2} k + c_{3} n, c_{4} k (log n - log k + c_{5})\} .

(11)

Proof (sketch).

For a detailed proof of Theorem 6, we refer the reader to Mannor and Tsitsiklis [36]. The main idea is to construct a collection of k vectors

{θ^{1}, \dots, θ^{k}} \subseteq {[0, 1]}^{k}

corresponding to the parameters of the reward distributions on arms. For each

2 \leq i \leq k

, we define the vector

θ^{i} = (θ_{1}^{i}, \dots, θ_{k}^{i})

such that

θ_{1}^{i} = \frac{1}{2} + \frac{ϵ}{2}, θ_{i}^{i} = \frac{1}{2} + ϵ, θ_{j}^{i} = \frac{1}{2}, for j \notin {1, i},

and we define the vector

θ^{1}

such that

θ_{1}^{1} = \frac{1}{2} + \frac{ϵ}{2}, θ_{j}^{1} = \frac{1}{2}, for j > 1 .

In other words, the reward distribution of arm 1 is the same for all k parameter settings, but in the case of vector

θ^{i}

, the reward distribution for arm i is slightly better than the reward distributions of the other arms. We then compute a weighted sum of the regret incurred in each parameter setting, where

θ^{1}

is given weight

\frac{1}{2}

and all other

θ^{i}

’s are given weight

\frac{1}{2 (n - 1)}

. We may show that this weighted regret is lower-bounded by the quantity appearing in inequality (11), implying the existence of at least one parameter setting that satisfies the desired bound. Computing the lower bound for the weighted regret is similar to the procedure adopted in the proof of Theorem 5, in that we compute a lower bound on the expected number of arm pulls of each suboptimal arm in each parameter setting in terms of

ϵ

. ☐

Theorem 6 is a type of minimax result, stating that, for any player strategy, a distribution of Bernoulli rewards exists for which the problem incurs

Ω (log n)

regret. The same UCB strategies of Auer et al. [33] may be used to obtain

O (log n)

upper bounds on the minimax regret even for the worst-case reward distribution, showing that the bound stated in Theorem 6 is tight.

4.2. Adversarial Bandits

In the adversarial setting, we allow the reward distributions to vary arbitrarily over time. Thus, we assume that the reward distributions are chosen by an “adversary”, where the class of permissible adversarial strategies is denoted by

P

. For a player strategy S and an adversarial strategy

P \in P

, we define the pseudo-regret analogously to the stochastic case:

{\bar{R}}_{n} (S, P) = max_{1 \leq j \leq k} E_{P} [\sum_{t = 1}^{n} X_{j, t}] - E_{S, P} [\sum_{t = 1}^{n} X_{I_{t}, t}],

where the first expectation is taken with respect to possible randomization in the adversarial strategy, and the second expectation is taken with respect to randomization in the strategies of both the player and adversary.

The following result provides a lower bound for the minimax pseudo-regret, where the supremum is taken over

P_{B e r}

, the set of all Bernoulli reward distributions over the k time steps, and the infimum is taken over all player strategies [30,37]:

Theorem 7.

The minimax pseudo-regret satisfies the bound

inf_{S \in S} sup_{P \in P_{B e r}} {\bar{R}}_{n} (S, P) \geq \frac{1}{18} \cdot min {\sqrt{n k}, n},

where the infimum is taken over all (possibly randomized) player strategies.

Proof (sketch).

Note that it suffices to prove the bound when the infimum is taken over deterministic player strategies, since the pseudo-regret for a randomized strategy will be a convex combination of the pseudo-regret of deterministic strategies. Fix a deterministic player strategy, and consider the reward distributions

P_{1}, \dots, P_{k} \in P_{B e r}

, where

P_{j}

corresponds to the distribution where the reward of each arm

i \neq j

is i.i.d. Bernoulli

(\frac{1}{2})

, and the reward of arm j is i.i.d. Bernoulli

(\frac{1}{2} + ϵ)

. Note that this construction bears some similarity to the proof outline for Theorem 6 provided above, in that the reward distribution

P_{j}

slightly favors arm j. We will also compute a lower bound for the weighted regret, this time allocating uniform weights to each parameter setting, in order to conclude the existence of at least one assignment of reward distributions satisfying the desired lower bounds. Let

E_{j}

denote the expectation with respect to the reward distribution

P_{j}

.

We may compute

\frac{1}{k} \sum_{j = 1}^{k} {\bar{R}}_{n} (S, P_{j}) = \frac{1}{k} \sum_{j = 1}^{k} E_{j} [\sum_{i \neq j} ϵ T_{i} (n)] = \frac{ϵ}{k} \sum_{j = 1}^{k} E_{j} [n - T_{j} (n)] = ϵ (n - \frac{1}{k} \sum_{j = 1}^{k} E_{j} [T_{j} (n)]),

(12)

where

T_{i} (n)

denotes the number of pulls of arm i.

Let

P

denote the reward distribution where all arms have a Bernoulli

(\frac{1}{2})

distribution. We may obtain the following bound:

E_{j} [T_{j} (n)] \overset{(a)}{\leq} E_{P} [T_{j} (n)] + n \sqrt{\frac{1}{2} D_{K L} (P ∥ P_{j})} \overset{(b)}{=} E_{P} [T_{j} (n)] + \frac{n}{2} \sqrt{log (\frac{1}{1 - 4 ϵ^{2}}) E [T_{j} (n)]},

(13)

where inequality

(a)

may be derived by first relating the difference in expectations for bounded random variables to total variation distance and then applying Pinsker’s inequality, and equality

(b)

follows from a direct computation. Combining inequalities (12) and (13), we then obtain

\begin{matrix} \frac{1}{k} \sum_{j = 1}^{k} {\bar{R}}_{n} (S, P_{j}) & \geq ϵ (n - \frac{n}{k} - \frac{n}{2 k} \sqrt{log (\frac{1}{1 - 4 ϵ^{2}})} \sum_{j = 1}^{k} \sqrt{E [T_{j} (n)]}) \\ = ϵ n (1 - \frac{1}{k} - \frac{1}{2} \sqrt{log (\frac{1}{1 - 4 ϵ^{2}})} \frac{1}{k} \sum_{j = 1}^{k} \sqrt{E [T_{j} (n)]}) \\ \geq ϵ n (1 - \frac{1}{k} - \frac{1}{2} \sqrt{log (\frac{1}{1 - 4 ϵ^{2}})} \sqrt{\frac{n}{k}}), \end{matrix}

using the concavity of the square root function. Choosing

ϵ = \frac{1}{4 n} min {\sqrt{k n}, n}

then yields the inequality

sup_{P \in P_{B e r}} {\bar{R}}_{n} (S, P) \geq \frac{1}{k} \sum_{j = 1}^{k} {\bar{R}}_{n} (S, P_{j}) \geq \frac{1}{18} min {\sqrt{k n}, n},

and taking an infimum over all player strategies produces the desired result. ☐

Note that the lower bound provided in Theorem 7 clearly also holds when the supremum is taken over any class of adversarial strategies containing

P_{B e r}

. In particular, one topic of study is that of oblivious adversaries, which are allowed to perform any strategy that is non-adaptive to the actions of the player (i.e., it is chosen before the start of the first round). The Exp3 algorithm provides an upper bound on the minimax pseudo-regret for oblivious adversaries that matches the lower bound in Theorem 7 up to a factor of

\sqrt{log k}

[37]. The study of non-oblivious adversaries refers to the setting where the adversary’s actions may be chosen in response to the player’s sequential choices, as well, and is also an active area of research [31,38].

5. Discussion

In this article, we have presented several distinct approaches for deriving lower bounds in various statistical learning problems. In each of the settings described—statistical estimation, community recovery, and online learning—we have shown how to simplify the problem to one involving channel decoding, and leverage information-theoretic bounds on the hardness of the decoding problem to bound the hardness of the corresponding statistical problem. It is worth reflecting on the similarities between the techniques employed in each of the approaches. Although the specific interpretation involving channel decoding looks quite different in each of the settings, the trick is to find an appropriate discretization of parameter space so that pairs of parameters are relatively far apart, but the corresponding data-generating distributions are close. In the context of statistical estimation, this means that we construct a packing of parameter space. In the community recovery setting, we consider pairs of community partitions that differ only in the assignment of a single node. In the multi-armed bandit setting, we consider pairs of arm parameters that flip the assignment of the optimal arm, while perturbing the parameter values as little as possible.

On a more applied note, information-theoretic tools have made an appearance in various machine learning algorithms involving maximizing independence between observed quantities. Some examples include decision tree learning via information gain [39]; independent component analysis by mutual information minimization [40]; causal inference algorithms maximizing independence [41]; minimal-redundancy-maximal-relevance (mRMR) methods for feature selection [42]; and image registration via mutual information maximization in medical imaging [43]. As a result, quantities such as mutual information have become increasingly mainstream in data science applications. Note, however, that such applications of information theory to machine learning have no connection to the channel decoding techniques or hardness results discussed in this article. In terms of statistical theory, these applications have created a renewed interest in deriving efficient estimators of entropy and other related information measures based on finite samples [44,45,46,47], but a detailed discussion of such methods is somewhat orthogonal to the main topic of this survey.

Acknowledgments

The author thanks Varun Jog, the Assitant Editor, and the anonymous referees for helpful comments that enhanced the clarity of the paper.

Conflicts of Interest

The author declares no conflict of interest.

References

Bousquet, O.; Boucheron, S.; Lugosi, G. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning; Springer: Berlin/Heidelberger, Germany, 2004; pp. 169–207. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer: Berlin/Heidelberger, Germany, 2001; Volume 1. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax rates of estimation for high-dimensional linear regression over ℓ_q-balls. IEEE Trans. Inf. Theory 2011, 57, 6976–6994. [Google Scholar] [CrossRef]
Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer: Berlin/Heidelberger, Germany, 2008. [Google Scholar]
Santhanam, N.P.; Wainwright, M.J. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Trans. Inf. Theory 2012, 58, 4117–4134. [Google Scholar] [CrossRef]
Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
Cai, T.T.; Zhang, C.H.; Zhou, H.H. Optimal rates of convergence for covariance matrix estimation. Ann. Stat. 2010, 38, 2118–2144. [Google Scholar] [CrossRef]
Amini, A.A.; Wainwright, M.J. High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components. Ann. Stat. 2009, 37, 2877–2921. [Google Scholar] [CrossRef]
Lehmann, E.L.; Casella, G. Theory of Point Estimation; Springer Science & Business Media: Berlin/Heidelberger, Germany, 2006. [Google Scholar]
Kühn, T. A lower estimate for entropy numbers. J. Approx. Theory 2001, 110, 120–124. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Yang, Y.; Barron, A. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar]
Lorentz, G.G. Metric entropy and approximation. Bull. Am. Math. Soc. 1966, 72, 903–937. [Google Scholar] [CrossRef]
Tikhomirov, V.M.; Shiryayev, A.N. ϵ-entropy and ϵ-capacity of sets in functional spaces. In Selected Works of A.N. Kolmogorov: Volume III: Information Theory and the Theory of Algorithms; Springer: Dordrecht, The Netherlands, 1993; pp. 86–170. [Google Scholar]
Stone, C.J. Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1982, 10, 1040–1053. [Google Scholar] [CrossRef]
Abbe, E. Community detection and stochastic block models: Recent developments. arXiv, 2017; arXiv:1703.10146. [Google Scholar]
Zhang, A.Y.; Zhou, H.H. Minimax rates of community detection in stochastic block models. Ann. Stat. 2016, 44, 2252–2280. [Google Scholar] [CrossRef]
Gao, C.; Ma, Z.; Zhang, A.Y.; Zhou, H.H. Achieving optimal misclassification proportion in stochastic block model. arXiv, 2015; arXiv:1505.03772. [Google Scholar]
Xu, M.; Jog, V.; Loh, P. Optimal Rates for Community Estimation in the Weighted Stochastic Block Model. arXiv, 2017; arXiv:1706.01175. [Google Scholar]
Abbe, E.; Sandon, C. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 17–20 October 2015; pp. 670–688. [Google Scholar]
Yun, S.Y.; Proutiere, A. Optimal cluster recovery in the labeled stochastic block model. In Advances in Neural Information Processing Systems, Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 4–9 December 2016; The Neural Information Processing Systems (NIPS) Foundation: La Jolla, CA, USA, 2016; pp. 965–973. [Google Scholar]
Bollobás, B. Random Graphs (Cambridge Studies in Advanced Mathematics); Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Abbe, E.; Bandeira, A.S.; Hall, G. Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 2016, 62, 471–487. [Google Scholar] [CrossRef]
Jog, V.; Loh, P. Recovering communities in weighted stochastic block models. In Proceedings of the 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 29 September–2 October 2015; pp. 1308–1315. [Google Scholar]
Abbe, E.; Montanari, A. Conditional random fields, planted constraint satisfaction and entropy concentration. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques; Springer: Berlin/Heidelberger, Germany, 2013; pp. 332–346. [Google Scholar]
Chen, Y.; Suh, C.; Goldsmith, A.J. Information recovery from pairwise measurements. IEEE Trans. Inf. Theory 2016, 62, 5881–5905. [Google Scholar] [CrossRef]
Chen, Y.; Xu, J. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res. 2016, 17, 882–938. [Google Scholar]
Hajek, B.; Wu, Y.; Xu, J. Submatrix localization via message passing. arXiv, 2015; arXiv:1510.09219. [Google Scholar]
Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 2012, 5, 1–122. [Google Scholar] [CrossRef]
Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning, and Games; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia, 8–10 July 2002; Springer: Berlin, Germany, 2002; pp. 255–270. [Google Scholar]
Mannor, S.; Tsitsiklis, J.N. The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 2004, 5, 623–648. [Google Scholar]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 2002, 32, 48–77. [Google Scholar] [CrossRef]
Maillard, O.; Munos, R. Adaptive bandits: Towards the best history-dependent strategy. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 570–578. [Google Scholar]
Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
Hyvärinen, A.; Karhunen, J.; Oja, E. ICA by Minimization of Mutual Information. In Independent Component Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 2002; pp. 221–227. [Google Scholar]
Janzing, D.; Mooij, J.; Zhang, K.; Lemeire, J.; Zscheischler, J.; Daniušis, P.; Steudel, B.; Schölkopf, B. Information-geometric approach to inferring causal directions. Artif. Intell. 2012, 182, 1–31. [Google Scholar] [CrossRef]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [PubMed]
Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1995, 52, 6841. [Google Scholar] [CrossRef]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Valiant, G.; Valiant, P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef]

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Loh, P.-L. On Lower Bounds for Statistical Learning Theory. Entropy 2017, 19, 617. https://doi.org/10.3390/e19110617

AMA Style

Loh P-L. On Lower Bounds for Statistical Learning Theory. Entropy. 2017; 19(11):617. https://doi.org/10.3390/e19110617

Chicago/Turabian Style

Loh, Po-Ling. 2017. "On Lower Bounds for Statistical Learning Theory" Entropy 19, no. 11: 617. https://doi.org/10.3390/e19110617

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Lower Bounds for Statistical Learning Theory

Abstract

1. Introduction

2. Statistical Estimation

2.1. Fano’s Method

2.2. Local Packings

2.3. Metric Entropy

3. Community Recovery

3.1. Weak Recovery

3.2. Exact Recovery

4. Online Learning

4.1. Stochastic Bandits

4.2. Adversarial Bandits

5. Discussion

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI