On Gap-Based Lower Bounding Techniques for Best-Arm Identification

Truong, Lan V.; Scarlett, Jonathan

doi:10.3390/e22070788

Open AccessArticle

On Gap-Based Lower Bounding Techniques for Best-Arm Identification

by

Lan V. Truong

^1,*

and

Jonathan Scarlett

²

¹

Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK

²

Department of Computer Science & Department of Mathematics, National University of Singapore, Singapore 117418, Singapore

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(7), 788; https://doi.org/10.3390/e22070788

Submission received: 16 June 2020 / Revised: 14 July 2020 / Accepted: 17 July 2020 / Published: 20 July 2020

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science II)

Download Review Reports Versions Notes

Abstract

:

In this paper, we consider techniques for establishing lower bounds on the number of arm pulls for best-arm identification in the multi-armed bandit problem. While a recent divergence-based approach was shown to provide improvements over an older gap-based approach, we show that the latter can be refined to match the former (up to constant factors) in many cases of interest under Bernoulli rewards, including the case that the rewards are bounded away from zero and one. Together with existing upper bounds, this indicates that the divergence-based and gap-based approaches are both effective for establishing sample complexity lower bounds for best-arm identification.

Keywords:

multi-armed bandits; best-arm identification; information-theoretic lower bounds; PAC learning

1. Introduction

The multi-armed bandit (MAB) problem [1] provides a versatile framework for sequentially searching for high-reward actions, with applications including clinical trials [2], online advertising [3], adaptive routing [4], and portfolio design [5]. The best-arm identification problem seeks to find the arm with the highest mean using as few arm pulls as possible, and dates back to the works of Bechhofer [6] and Paulson [7]. More recently, several algorithms have been proposed for best-arm identification, including successive elimination [8], lower-upper confidence bound algorithms [9,10], PRISM [11], and gap-based elimination [12]. The latter establishes a sample complexity that is known to be optimal in the two-arm case [13], and more generally near-optimal.

Complementary to these upper bounds is information-theoretic lower bounds on the performance of any algorithm. Such bounds serve as a means to assess the degree of optimality of practical algorithms, and identify where further improvements are possible, thus focusing research towards directions that can have the greatest practical impact. Lower bounds were given by Mannor and Tsitsiklis [14] for Bernoulli bandits, and by Kaufmann et al. [15] for more general reward distributions. Both of these works were based on the difficulty of distinguishing bandit instances that differ in only a single arm distribution, but the subsequent analysis techniques differed significantly, with [14] using a direct change-of-measure analysis and introducing gap-based quantities equaling the difference between two arm means, and [15] using a form of the data processing inequality for KL divergence. We refer to these as the gap-based and divergence-based approaches, respectively. Further works on best-arm identification lower bounds include [16,17,18].

The divergence-based approach was shown in [15] to attain a stronger result than that of [14] with a simpler proof, as we outline in Section 2.2. In this paper, we address the question of whether the gap-based approach is fundamentally limited, or can be refined to attain a similar results to [15]. We show that the correct answer is the latter in many cases of interest, by suitable refinements of the analysis of [14]. The existing results and our results are presented in Section 2, and our analysis is presented in Section 3.

2. Overview of Results

2.1. Problem Setup

We consider the following setup:

There are M arms with Bernoulli rewards; the means are $p = (p_{1}, p_{2}, \dots, p_{M})$ , and this set of means is said to define the bandit instance. Our analysis will consider instances with arms sorted such that $p_{1} \geq p_{2} \dots \geq p_{M}$ , without loss of generality.
The agent would like to find an arm whose arm mean is within $ϵ$ of the highest arm mean for some $0 < ε < 1$ , i.e., $p_{l} > p_{1} - ε$ . Even if there are multiple such arms, just identifying one of them is good enough.
In each round, the agent can pull any arm $l \in [M]$ and observe an reward $X_{l}^{(s)} \sim B e r n o u l l i (p_{l})$ , where s is the number of times the l-th arm has been pulled so far. We assume that the rewards are independent, both across arms and across times.
In each round, the agent can alternatively choose to terminate and output an arm index $\hat{l}$ believed to be $ϵ$ -optimal. The index at which this occurs is denoted by T, and is a random variable because it is allowed to depend on the rewards observed. We are interested in the expected number of arm pulls (also called the sample complexity) $E_{p} [T]$ for a given instance $p$ , which should ideally be as low as possible.
An algorithm is said to be $(ε, δ)$ -PAC (Probably Approximately Correct) if, for all bandit instances, it outputs an $ε$ -optimal arm with probability at least $1 - δ$ when it terminates at the stopping time T.

We will frequently make use of some fundamental quantities. First, the best arm mean and the gap to the best arm are denoted by

\begin{matrix} p_{*} & : = p_{1}, \end{matrix}

(1)

\begin{matrix} Δ_{l} & : = p_{*} - p_{l} . \end{matrix}

(2)

The set of

ϵ

-optimal arms and the set of

ϵ

-suboptimal arms are respectively given by

M (p, ε) : = {l \in [M] : p_{l} > p_{*} - ε},

(3)

N (p, ε) : = {l \in [M] : p_{l} \leq p_{*} - ε},

(4)

and we make use of the binary KL divergence function

KL (p, q) : = p log \frac{p}{q} + (1 - p) log \frac{1 - p}{1 - q},

(5)

where here and subsequently,

log (\cdot)

denotes the natural logarithm.

2.2. Existing Lower Bounds

For any fixed

\underset{̲}{p} \in (0, 1 / 2)

, Mannor and Tsitsiklis [14] showed that if an algorithm is

(ϵ, δ)

-PAC with respect to all instances with

{min}_{l} p_{l} \geq \underset{̲}{p} > 0

, and if

ϵ \leq \frac{1 - p^{*}}{4}

and

δ \leq e^{- 8} / 8

, then for any constant

α \in (0, 2)

, there exists

c_{1} = O ({\underset{̲}{p}}^{2})

(depending on

α

) such that

\begin{matrix} E_{p} [T] \geq c_{1} [\frac{{(| \tilde{M} (p, ε) | - 1)}^{+}}{ε^{2}} + \sum_{l \in \tilde{N} (p, ε)} \frac{1}{Δ_{l}^{2}}] log \frac{1}{8 δ} \end{matrix}

(6)

where

\begin{matrix} \tilde{M} (p, ε) = M (p, ε) \cap \{l \in [M] : p_{l} \geq \frac{ε + p_{*}}{2 - α}\}, \end{matrix}

(7)

\begin{matrix} \tilde{N} (p, ε) = N (p, ε) \cap \{l \in [M] : p_{l} \geq \frac{ε + p_{*}}{2 - α}\} . \end{matrix}

(8)

Note that the subsets

\tilde{M} (p, ε)

and

\tilde{N} (p, ε)

do not always form a partition of the arms, i.e., it may hold that

\tilde{M} (p, ε) \cup \tilde{N} (p, ε) ⊊ [M]

. The sets increase in size as

α

decreases, but implicitly this leads to a lower value of

c_{1}

. In addition, as we will see below, the

{\underset{̲}{p}}^{2}

dependence entering via

c_{1}

is not necessary.

We also note that the lower bound in (6) depends on the instance-specific quantities

\tilde{M} (p, ε)

,

\tilde{N} (p, ε)

, and

Δ_{l}

, and is thus an instance-dependent bound. On the other hand, the lower bound is only stated for

(ε, δ)

-PAC algorithms, and the PAC guarantee requires the algorithm to eventually succeed on any instance (subject to the assumptions given on

p_{l}

,

ϵ

, and

δ

).

Kaufmann et al. [15] improved Mannor and Tsitsiklis’s lower bound by using a form of data processing inequality for KL divergence, leading to the following whenever

δ \leq 0.15

and

0 < ε < min {p_{*}, 1 - p_{*}}

[15] (Remark 5):

\begin{matrix} E_{p} [T] \geq [\frac{| M (p, ε) | - 1}{KL (p_{*} - ε, p_{*} + ε)} + \sum_{l \in N (p, ε)} \frac{1}{KL (p_{l}, p_{*} + ε)}] log \frac{1}{2.4 δ} . \end{matrix}

(9)

To directly compare this result with (6), it is useful to apply the following inequality [19] (Equation (2.8)):

\begin{matrix} 2 {(p - q)}^{2} \leq KL (p, q) \leq \frac{{(p - q)}^{2}}{q (1 - q)}, \end{matrix}

(10)

which yields

\begin{matrix} E_{p} [T] \geq (p^{*} + ϵ) (1 - p^{*} - ϵ) [\frac{| M (p, ε) | - 1}{4 ϵ^{2}} + \sum_{l \in N (p, ε)} \frac{1}{{(ε + Δ_{l})}^{2}}] log \frac{1}{2.4 δ} . \end{matrix}

(11)

Even this weakened bound can significantly improve on (6), since (i)

M (p, ε) \supset \tilde{M} (p, ε)

and

N (p, ε) \supset \tilde{N} (p, ε)

, (ii) the

{\underset{̲}{p}}^{2}

dependence is replaced by

(p^{*} + ϵ) (1 - p^{*} - ϵ)

, so the dependence on the smallest arm mean is avoided (The

1 - p^{*} - ϵ

term is potentially small when

ϵ

is close to

1 - p^{*}

, but since (6) assumes

ϵ \leq \frac{1 - p^{*}}{4}

, we can still say that (11) is at least as good as (6)), and (iii) the assumption

ϵ \leq \frac{1 - p^{*}}{4}

is avoided.

2.3. Our Result and Discussion

Our lower bound, stated in the following theorem, is developed based on Mannor and Tsitsiklis’s analysis for best-arm identification [14] (Theorem 1), but uses novel refinements of the techniques therein to further optimize the bound (see Appendix C for an overview of these refinements).

Theorem 1.

For any bandit instance

p \in {(0, p_{*}]}^{M}

with

p_{*} \in (0, 1)

, and any

(ε, δ)

-PAC algorithm with

0 < ε < 1 - p_{*}

and

0 < δ < δ_{0}

for some

δ_{0} < 1 / 4

, we have

\begin{matrix} E_{p} [T] & \geq \frac{2 γ_{0} (p_{*} + ε) (1 - p_{*} - ε)}{7 (ξ + 1)} [\frac{| M (p, ε) | - 1}{4 ε^{2}} + \sum_{l \in N (p, ε)} \frac{1}{{(ε + Δ_{l})}^{2}}] log \frac{1 + 4 δ_{0}}{4 δ}, \end{matrix}

(12)

where

\begin{matrix} γ_{0} & = \frac{1 - 4 δ_{0}}{8}, \end{matrix}

(13)

\begin{matrix} θ & = \frac{2 δ}{1 - 4 γ_{0}} = \frac{4 δ}{1 + 4 δ_{0}}, \end{matrix}

(14)

and

ξ > 0

is the unique positive solution of the following quadratic equation:

\begin{matrix} 7 γ_{0} ξ^{2} log \frac{1}{θ} = 3 (ξ + 1) . \end{matrix}

(15)

Observe that this result matches (11) (with modified constants), and therefore exhibits the above benefit of depending on the full sets

M

and

N

without the condition

p_{l} \geq \frac{ε + p_{*}}{2 - α}

(see (7)–(8)), as well as avoiding the dependence on

\underset{̲}{p}

, and permitting the broadest range of

ϵ

and

δ

among the above results.

The result (11) in turn matches (9) whenever the right-hand inequality in (10) is tight (i.e., whenever

KL (p, q) = Θ (\frac{{(p - q)}^{2}}{q (1 - q)})

). This is clearly true when p and q (representing the arm means) are bounded away from zero and one, and also in certain limiting cases approaching these endpoints (e.g., when p and q both tend to one, but

\frac{1 - p}{1 - q} = Θ (1)

). However, there are also limiting cases where the upper bound in (10) is not tight (e.g.,

p = 1 - \sqrt{η}

and

q = 1 - η

as

η \to 0

), and in such cases, the bound (9) remains tighter than that of Theorem 1.

3. Proof of Theorem 1

We follow the general steps of (Theorem 5 [14]), but with several refinements to improve the final bound. The main differences are outlined in Appendix C.

Step 1: Defining a Hypothesis Test

Let us denote the true (unknown) expected reward of each arm by

Q_{l}

for all

l \in [M]

. Similarly to [14,15], we consider M hypotheses as follows:

\begin{matrix} H_{1} : Q_{l} = p_{l}, \end{matrix} \begin{matrix} \forall l \in [M], \end{matrix}

(16)

and for each

l \neq 1

,

\begin{matrix} H_{l} : Q_{l} = p_{*} + ε, \end{matrix} \begin{matrix} Q_{l^{'}} = p_{l^{'}} \end{matrix} \begin{matrix} \forall l^{'} \in [M] \ {l} . \end{matrix}

(17)

If hypothesis

H_{l}

is true, the

(ϵ, δ)

-PAC algorithm must return arm l with probability at least

1 - δ

. We will bound sample complexity when the hypothesis

H_{l}

is true. We denote by

E_{l}

and

P_{l}

the expectation and probability, respectively, under hypothesis

H_{l}

.

Let

B_{l}

be the event that the algorithm returns arm l. Since

\sum_{l \in M (p, ε)} P_{1} (B_{l}) \leq 1

and

| M (p, ε) | \geq 1

, there is at most one arm

l_{0} \in M (p, ε)

that satisfies

P_{1} (B_{l_{0}}) > \frac{1}{2}

. Defining

\begin{matrix} M_{0} (p, ε) : = \{l \in M (p, ε) : P_{1} [B_{l}] \leq \frac{1}{2}\} = \{l \in M (p, ε) : l \neq l_{0}\}, \end{matrix}

(18)

it follows that

\begin{matrix} | M_{0} {(p, ε) | \geq (| M (p, ε) | - 1)}^{+} . \end{matrix}

(19)

Define

\begin{matrix} T (p, ε) : = M_{0} (p, ε) \cup N (p, ε), \end{matrix}

(20)

as well as

\begin{matrix} B_{M (p, ε)} : = ⋃_{l \in M (p, ε)} B_{l}, \end{matrix}

(21)

which is the event that the policy eventually select an arm in the

ε

-neighborhood of the best arm in

[M]

. Since the policy is

(ε, δ)

-correct with

δ < δ_{0}

, we must have

\begin{matrix} P_{1} [B_{M (p, ε)}] \geq 1 - δ > 1 - δ_{0}, \end{matrix}

(22)

and it follows from (18) and (22) that

\begin{matrix} P_{1} [B_{l}] & \leq max \{δ_{0}, 1 / 2\} \end{matrix}

(23)

\begin{matrix} = \frac{1}{2} \end{matrix}

(24)

for all

l \in T (p, ε)

.

Step 2: Bounding the Number of Pulls of Each Arm

Before proceeding, we make some additional definitions:

\begin{matrix} α_{l} & : = \frac{ε + Δ_{l}}{(1 - p_{l}) p_{l}}, \end{matrix}

(25)

\begin{matrix} β_{l} & : = α_{l} \sqrt{\frac{1 - p_{l}}{1 - (p_{*} + ε)}}, \end{matrix}

(26)

\begin{matrix} {\tilde{α}}_{l} & : = α_{l} - \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2}, \end{matrix}

(27)

\begin{matrix} {\tilde{β}}_{l} & : = β_{l} - \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2}, \end{matrix}

(28)

The definitions (27) and (28) will only be used for arms with

\frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}

, and for such arms, we will establish in the analysis that

{\tilde{α}}_{l} \geq 0

and

{\tilde{β}}_{l} \geq 0

.

We prove the following lemma, characterizing the probability of a certain event in which (i) the number of pulls of some arm

l \in T (p, ε)

falls below a suitable threshold (event

A_{l}

below), (ii) a deviation bound holds regarding the number of observed 1’s from pulling arm l (event

C_{l}

below), and (iii) arm l is not returned (event

B_{l}^{c}

).

Lemma 1.

For each

l \in [M]

, let

T_{l}

be the total number of times that arm l is pulled under the

(ε, δ)

-correct policy. Let

K_{l} = X_{l}^{(1)} + X_{l}^{(2)} + \dots + X_{l}^{(T_{l})}

be the total number of unit rewards obtained from pulling the arm l up to the

T_{l}

-th time. Let

\begin{matrix} G_{1, l} : = \frac{7 p_{l}^{2} α_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}, \end{matrix}

(29)

\begin{matrix} G_{2, l} : = [{\tilde{β}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}}] 1 \{\frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}\}, \end{matrix}

(30)

where

α_{l}

,

{\tilde{α}}_{l}

, and

{\tilde{β}}_{l}

are defined in (25), (27), and (28), respectively. Let

\begin{matrix} ν_{l} : = (ξ + 1) (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}), \end{matrix}

(31)

where ξ is defined in (15). Define the following events:

\begin{matrix} A_{l} : = \{G_{1, l} \leq \frac{1}{ν_{l}} log \frac{1}{θ}\}, \end{matrix}

(32)

\begin{matrix} C_{l} = \{G_{2, l} \leq \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}\}, \end{matrix}

(33)

\begin{matrix} S_{l} : = A_{l} \cap B_{l}^{c} \cap C_{l} . \end{matrix}

(34)

If

l \in T (p, ε)

(see (20)), then under the condition that

\begin{matrix} E_{1} [G_{1, l}] < \frac{γ_{0}}{ν_{l}} log \frac{1}{θ}, \end{matrix}

(35)

we have

\begin{matrix} P_{1} [S_{l}] > \frac{1 - 4 γ_{0}}{2} . \end{matrix}

(36)

Proof.

See Appendix A. □

Intuitively,

A_{l}

is the event that the total number of times that arm l is pulled is small, and

C_{l}

is the event that

| p_{l} T_{l} - K_{l} |

is not too large (since pulling an arm

T_{l}

times should produce roughly

p_{l} T_{l}

ones). The lemma indicates that if

E [T_{l}]

is not too large, then

P [A_{l} \cap B_{l}^{c} \cap C_{l}]

is lower bounded, and this will ultimately lead to a lower bound on

P [B_{l}^{c}]

, the event of primary interest.

In Lemma 2 below, we will use Lemma 1 to deduce a lower bound on

E_{1} [G_{1, l}]

, which amounts to a lower bound on the average number of arm pulls by the definition of

G_{1, l}

. Before doing so, we introduce a likelihood ratio that will be used in a change-of-measure argument [14].

For any given time

t \geq 1

and

l \in [M]

, let

T_{l} (t)

be the total number of times that arm l is pulled by time t. Define

\begin{matrix} X_{l}^{T_{l} (t)} : = {X_{l}^{(1)}, X_{l}^{(2)}, \dots, X_{l}^{(T_{l} (t))}}, \end{matrix}

(37)

and let

\begin{matrix} F_{t} : = σ (X_{1}^{T_{1} (t)}, X_{2}^{T_{2} (t)}, \dots, X_{M}^{T_{M} (t)}) \end{matrix}

(38)

be the

σ

-algebra generated by

X_{1}^{T_{1} (t)}, X_{2}^{T_{2} (t)}, \dots, X_{M}^{T_{M} (t)}

for all

t = 1, 2, \dots

.

Recall that T is the stopping time of the algorithm, and that

T_{l} : = T_{l} (T)

for all

l \in [M]

. Moreover, let

W = F_{T}

be the entire history up to the stopping time T. We define the following likelihood ratio:

\begin{matrix} L_{l} (w) = \frac{P_{l} (W = w)}{P_{1} (W = w)} \end{matrix}

(39)

for every possible history w. Moreover, we let

L_{l} (W)

denote the corresponding random variable. Given the history up to time

T - 1

(i.e.,

F_{T - 1}

), the arm reward at time T has the same probability distribution under

H_{1}

and

H_{l}

unless the chosen arm is arm l. Therefore, we have

\begin{matrix} L_{l} (W) = \frac{{(p_{*} + ε)}^{K_{l}} {(1 - p_{*} - ε)}^{T_{l} - K_{l}}}{p_{l}^{K_{l}} {(1 - p_{l})}^{T_{l} - K_{l}}}, \end{matrix}

(40)

where

K_{l} : = X_{l}^{(1)} + X_{l}^{(2)} + \dots + X_{l}^{(T_{l})}

(or the total number of 1’s in the

T_{l}

pulls of the arm l).

The following proposition presents one of our key technical results towards establishing the lower bound. We use the definitions in (1)–(5), along with (25)–(28).

Proposition 1.

Fix the bandit instance

p

, the parameter

0 < ε < 1 - p_{*}

, and the history W with corresponding values

K_{l}

and

T_{l}

. Recalling the definitions of

α_{l}

,

{\tilde{α}}_{l}

, and

{\tilde{β}}_{l}

in (25), (27), and (28), respectively, we have

\begin{matrix} L_{l} (W) \geq exp (- G_{1, l} - G_{2, l}), \end{matrix}

(41)

where

G_{1, l}

and

G_{2, l}

are defined in (29)–(30).

Proof.

See Appendix B. □

Based on Lemma 1 and Proposition 1, we obtain the following extension of [14] (Lemma 6) lower bounding the average of each

G_{1, l}

; this lower bound will later be translated to a lower bound on the number of arm pulls

T_{l}

.

Lemma 2.

For any arm

l \in T (p, ε)

, the following holds:

\begin{matrix} E_{1} [G_{1, l}] \geq \frac{γ_{0}}{ν_{l}} log \frac{1}{θ}, \end{matrix}

(42)

where θ and

ν_{l}

are defined in (14) and (31), respectively.

Proof.

We use a proof by contradiction. Assume that

\begin{matrix} E_{1} [G_{1, l}] < \frac{γ_{0}}{ν_{l}} log \frac{1}{θ}, \end{matrix}

(43)

then by Lemma 1, Equation (36) holds. Moreover, by Proposition 1, we have

\begin{matrix} L_{l} (W) \geq exp (- G_{1, l} - G_{2, l}), \end{matrix}

(44)

and recalling the definition of

S_{l}

in (34), it follows from (44) that

\begin{matrix} (45) & L_{l} (W) 1_{S_{l}} & \geq exp (- G_{1, l} - G_{2, l}) 1_{S_{l}} \\ (46) & \geq exp (- \frac{1}{ν_{l}} [ξ (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) + 1] log \frac{1}{θ}) 1_{S_{l}} \\ (47) & \geq exp (- \frac{1}{ν_{l}} [(ξ + 1) (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}})] log \frac{1}{θ}) 1_{S_{l}} \end{matrix}

where (46) follows from the definitions in (32)–(33), and (47) follows from the fact that

1 - p_{l} \geq 1 - p_{*} \geq 1 - p_{*} - ε

for all

l \in [M]

.

By the choice of

ν_{l} > 0

given in (31), it holds that

\begin{matrix} (ξ + 1) (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) \frac{1}{ν_{l}} = 1 . \end{matrix}

(48)

Hence, from (47) and (48), we have

\begin{matrix} L_{l} (W) 1_{S_{l}} \geq θ 1_{S_{l}} = \frac{2 δ}{1 - 4 γ_{0}} 1_{S_{l}}, \end{matrix}

(49)

for all

l \in T (p, ε)

, by the definition of

θ

in (14).

We are now ready to complete the proof:

\begin{matrix} (50) & P_{l} [B_{l}^{c}] & \geq P_{l} [S_{l}] \\ (51) & = E_{l} [1_{S_{l}}] \\ (52) & = E_{1} [L_{l} (W) 1_{S_{l}}] \\ (53) & \geq E_{1} [\frac{2 δ}{1 - 4 γ_{0}} 1_{S_{l}}] \\ (54) & = \frac{2 δ}{1 - 4 γ_{0}} P_{1} [S_{l}] \\ (55) & > \frac{2 δ}{1 - 4 γ_{0}} (\frac{1 - 4 γ_{0}}{2}) \\ (56) & = δ, \end{matrix}

where (50) follows from the definition of set

S_{l}

in (34), (52) follows by a standard change of measure [20], (53) follows from (49), and (55) follows from (36) of Lemma 1 (recall that we assumed (43)).

The inequality (56) shows a contradiction to the fact that under

H_{l}

, the

(ε, δ)

-correct bandit policy must return the arm l with probability at least

1 - δ

, i.e.,

P_{l} (B_{l}^{c}) \leq δ

. This concludes the proof. □

From Lemma 2 and the definition of

G_{1, l}

in (29), it holds that

\begin{matrix} E_{1} [\frac{7 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] \geq \frac{γ_{0}}{ν_{l}} log \frac{1}{θ} \end{matrix}

(57)

for all

l \in T (p, ε)

. Hence, and using the definition of

ν_{l}

in (31), we have

\begin{matrix} (58) & E_{1} [α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2} T_{l}] & \geq \frac{2 γ_{0} (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - p_{*} - ε)}}{7 (1 + ξ)} (\sqrt{\frac{1 - p_{*} - ε}{1 - p_{l}}}) log \frac{1}{θ} \\ (59) & = \frac{2 γ_{0} (p_{*} + ε) (1 - p_{*} - ε)}{7 (1 + ξ)} log \frac{1}{θ} . \end{matrix}

Step 3: Deducing a Lower Bound on the Sample Complexity

For any arm

l \in T (p, ε)

, by the definition of

α_{l}

in (25), we have

\begin{matrix} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2} = {(ε + Δ_{l})}^{2} . \end{matrix}

(60)

Note that

0 \leq Δ_{l} < ε

for all

l \in M_{0} (p, ε)

, since

M_{0} \subseteq M

, the set of

ϵ

-optimal arms. Therefore, we can further simplify (60) to

\begin{matrix} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2} \leq 4 ϵ^{2} \end{matrix}

(61)

for

l \in M_{0} (p, ε)

.

Substituting (60)–(61) into (59), we obtain

\begin{matrix} (62) & E_{p} [T] & = E_{p} [\sum_{l = 1}^{M} T_{l}] \\ (63) & \geq E_{p} [\sum_{l \in M_{0} (p, ε)} T_{l}] + E_{p} [\sum_{l \in N (p, ε)} T_{l}] \\ (64) & \geq \frac{2 γ_{0} (p_{*} + ε) (1 - p_{*} - ε)}{7 (ξ + 1)} [\frac{| M_{0} (p, ε) |}{4 ε^{2}} + \sum_{l \in N (p, ε)} \frac{1}{{(ε + Δ_{l})}^{2}}] log \frac{1}{θ} \\ (65) & = \frac{2 γ_{0} (p_{*} + ε) (1 - p_{*} - ε)}{7 (ξ + 1)} [\frac{| M_{0} (p, ε) |}{4 ε^{2}} + \sum_{l \in N (p, ε)} \frac{1}{{(ε + Δ_{l})}^{2}}] log \frac{1 + 4 δ_{0}}{4 δ}, \end{matrix}

where (65) uses the definition of

θ

in (14). Finally, we obtain (12) from (65) and (19).

4. Conclusion

We have presented a refined analysis of best-arm identification following the gap-based approach of [14], but incorporating refinements that circumvent some weaknesses, leading to a bound matching the divergence-based approach [15] in many cases. It would be of interest to determine whether further refinements could allow this approach to match [15] in all cases, or the extent to which the gap-based approach extends beyond Bernoulli rewards and/or beyond the standard best-arm identification problem (e.g., to ranking problems [21]).

Author Contributions

L.V.T. conceptualized the problem, and established and wrote the main results and proofs. J.S. provided ongoing supervision and corrections, and assistance with the writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough $E_{1}$ [ $G_{1, l}$ ])

As in [14], the proof of this lemma is based on Kolmogorov’s maximum inequality [22].

Lemma A1.

(Kolmogorov’s Theorem [22]) Let

Y_{1}, Y_{2}, \dots, Y_{n} : Ω \to R

be independent random variables defined on a common probability space

(Ω, F, P)

with expectation

E [Y_{k}] = 0

and variance

Var [Y_{k}] < \infty

for

k = 1, 2, \dots, n

. Then, for each

λ > 0

,

\begin{matrix} P [max_{1 \leq k \leq n} | S_{k} | \geq λ] \leq \frac{1}{λ^{2}} Var [S_{n}] = \frac{1}{λ^{2}} \sum_{k = 1}^{n} E [Y_{k}^{2}], \end{matrix}

(A1)

where

S_{k} = Y_{1} + Y_{2} + \dots + Y_{k}

.

We start by simplifying the main assumption of the lemma:

\begin{matrix} (A2) & \frac{γ_{0}}{ν_{l}} log \frac{1}{θ} & > E_{1} [G_{1, l}] \\ (A3) & \geq (\frac{1}{ν_{l}} log \frac{1}{θ}) P_{1} [G_{1, l} > \frac{1}{ν_{l}} log \frac{1}{θ}] \\ (A4) & = (\frac{1}{ν_{l}} log \frac{1}{θ}) P_{1} [A_{l}^{c}], \end{matrix}

where (A3) follows from Markov’s inequality, and (A4) follows from the definition of

A_{l}

in (32).

It follows from (A4) that

\begin{matrix} P_{1} [A_{l}] \geq 1 - γ_{0} . \end{matrix}

(A5)

We define

\begin{matrix} V = \{l \in [M] : \frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}\} \end{matrix}

(A6)

and we will find it convenient to treat the cases

l \in V

and

ł \notin V

separately. For

l \notin V

, from (30) and (33), we have

A_{l} \cap C_{l} = A_{l}

since

θ \in (0, 1), ξ > 0

, and

G_{2, l} = 0

, and it immediately follows from (A5) that

\begin{matrix} P_{1} [A_{l} \cap C_{l}] \geq 1 - γ_{0} . \end{matrix}

(A7)

On the other hand, for

l \in V

, we can simplify the definition of

{\tilde{α}}_{l}

in (27) as follows:

\begin{matrix} (A8) & {\tilde{α}}_{l} & = α_{l} - \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} \\ (A9) & = α_{l} - \frac{4}{3} (α_{l} (1 - p_{l})) (α_{l} (1 - p_{l})) \\ (A10) & = α_{l} - \frac{4}{3} (α_{l} (1 - p_{l})) (\frac{ε + Δ_{l}}{p_{l}}) \\ (A11) & \geq α_{l} - \frac{2}{3} (α_{l} (1 - p_{l})) \\ (A12) & \geq α_{l} - \frac{2}{3} α_{l} \\ (A13) & = \frac{1}{3} α_{l} \\ (A14) & > 0, \end{matrix}

where (A10) follows from (25), and (A11) follows from the definition of the set

V

in (A6). It follows that

\begin{matrix} 0 < {\tilde{α}}_{l} \leq α_{l} \leq β_{l} \end{matrix}

(A15)

for all

l \in V

, where the second inequality in (A15) follows from

p_{l} \leq p_{*} \leq p_{*} + ε

and the definitions of

α_{l}

and

β_{l}

in (25) and (26), respectively.

Similarly, for

l \in V

, we can simplify

{\tilde{β}}_{l}

from (28) as follows:

\begin{matrix} (A16) & {\tilde{β}}_{l} & = β_{l} - \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} \\ (A17) & \geq α_{l} - \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} \\ (A18) & = {\tilde{α}}_{l} \\ (A19) & > 0, \end{matrix}

where (A17) follows from (A15), (A18) follows from (27), and (A19) again uses (A15). It follows that

\begin{matrix} 0 < {\tilde{β}}_{l} \leq β_{l} \end{matrix}

(A20)

for all

l \in V

.

Now, let

\begin{matrix} Z_{l}^{(j)} : = β_{l} (X_{l}^{(j)} - p_{l}), j = 1, 2, \dots \end{matrix}

(A21)

Then, we have

\begin{matrix} E_{1} [Z_{l}^{(j)}] & = E_{1} [β_{l} (X_{l}^{(j)} - p_{l})] = β_{l} E_{1} [X_{l}^{(j)} - p_{l}] = 0 . \end{matrix}

(A22)

In addition, we note that

Z_{l}^{(1)}, Z_{l}^{(2)}, \dots,

are a i.i.d. sequence by the i.i.d. property of

X_{l}^{(1)}, X_{l}^{(2)}, \dots

.

For each positive integer

t_{l}

, let

K_{l, t_{l}} : = \sum_{j = 1}^{t_{l}} X_{l}^{(j)}

, and define

\begin{matrix} U_{l} & : = β_{l} (K_{l} - p_{l} T_{l}), \end{matrix}

(A23)

\begin{matrix} (A24) & V_{l} (t_{l}) & : = β_{l} (K_{l, t_{l}} - p_{l} t_{l}) \\ (A25) & = \sum_{j = 1}^{t_{l}} β_{l} (X_{l}^{(j)} - p_{l}) \\ (A26) & = \sum_{j = 1}^{t_{l}} Z_{l}^{(j)} . \end{matrix}

Observe that

\begin{matrix} (A27) & \sum_{j = 1}^{t_{l}} E_{1} [{(Z_{l}^{(j)})}^{2}] & = \sum_{j = 1}^{t_{l}} β_{l}^{2} E_{1} [(X_{l}^{(j)} - p_{l})^{2}] \\ (A28) & = \sum_{j = 1}^{t_{l}} β_{l}^{2} p_{l} (1 - p_{l}) \\ (A29) & = t_{l} β_{l}^{2} p_{l} (1 - p_{l}), \end{matrix}

where (A28) follows since a

Bernoulli (ρ)

variable has variance

ρ (1 - ρ)

.

We are now ready to upper bound

P_{1} [C_{l}^{c} \cap A_{l}]

for

l \in V

:

\begin{matrix} P_{1} [C_{l}^{c} \cap A_{l}] \\ (A30) & = P_{1} [\{{\tilde{β}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}} > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}\} \cap A_{l}] \\ (A31) & \leq P_{1} [\{{\tilde{β}}_{l} | p_{l} T_{l} - K_{l} | 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} | p_{l} T_{l} - K_{l} | 1 {p_{l} T_{l} \leq K_{l}} > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}\} \cap A_{l}] \\ (A32) & \leq P_{1} [\{β_{l} | p_{l} T_{l} - K_{l} | > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}\} \cap A_{l}] \\ (A33) & = P_{1} [\{| U_{l} | > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}\} \cap \{T_{l} < \frac{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}}{7 ν_{l} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}} log \frac{1}{θ}\}] \\ (A34) & \leq P_{1} [max_{t_{l} \leq \frac{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))} log \frac{1}{θ}}{7 ν_{l} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}} | V_{l} (t_{l}) | > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}] \\ (A35) & = P_{1} [max_{t_{l} \leq \frac{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))} log \frac{1}{θ}}{7 ν_{l} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}} | \sum_{j = 1}^{t_{l}} Z_{j} | > \frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ}], \end{matrix}

where:

(A30) uses the definitions of $C_{l}$ and $G_{2, l}$ ;
(A32) follows from (A15) and (A20), along with $1 {p_{l} T_{l} > K_{l}} + 1 {p_{l} T_{l} \leq K_{l}} = 1$ ;
(A33) uses the definitions of $U_{l}$ and $A_{l}$ ;
(A34) follows from the definitions of $U_{l}$ and $V_{l} (t_{l})$ in (A23) and (A24) (which imply $U_{l} = V_{l} (T_{l})$ );
(A35) follows from (A26);

Defining

n_{l} = \frac{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))} log \frac{1}{θ}}{7 ν_{l} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}

for brevity, we continue from (A35) as follows:

\begin{matrix} (A36) & P_{1} [C_{l}^{c} \cap A_{l}] & \leq max_{t_{l} \leq n_{l}} \frac{\sum_{j = 1}^{t_{l}} E_{1} [Z_{j}^{2}]}{{(\frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ})}^{2}} \\ (A37) & = max_{t_{l} \leq n_{l}} \frac{p_{l} (1 - p_{l}) t_{l}}{{(\frac{ξ}{ν_{l}} (\sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}}) log \frac{1}{θ})}^{2}} β_{l}^{2} \\ (A38) & \leq \frac{2 ν_{l} (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}}{7 ξ^{2} (\frac{1 - p_{l}}{1 - p_{*} - ε}) p_{l} (1 - p_{l}) log \frac{1}{θ}} {(\frac{β_{l}}{α_{l}})}^{2} \\ (A39) & = \frac{2 ν_{l} (p_{*} + ε) \sqrt{\frac{1 - (p_{*} + ε)}{1 - p_{l}}}}{7 ξ^{2} (\frac{1 - p_{l}}{1 - p_{*} - ε}) p_{l} log \frac{1}{θ}} {(\frac{β_{l}}{α_{l}})}^{2} \\ (A40) & \leq \frac{3 ν_{l} \sqrt{\frac{1 - (p_{*} + ε)}{1 - p_{l}}}}{7 ξ^{2} (\frac{1 - p_{l}}{1 - p_{*} - ε}) log \frac{1}{θ}} {(\frac{β_{l}}{α_{l}})}^{2} \\ (A41) & = \frac{3 ν_{l} \sqrt{\frac{1 - (p_{*} + ε)}{1 - p_{l}}}}{7 ξ^{2} (\frac{1 - p_{l}}{1 - p_{*} - ε}) log \frac{1}{θ}} (\frac{1 - p_{l}}{1 - p_{*} - ε}) \\ (A42) & = \frac{3 (ξ + 1)}{7 ξ^{2} log \frac{1}{θ}} \\ (A43) & = γ_{0}, \end{matrix}

where:

(A36) follows from Lemma A1 with $n = n_{l}$ and $k = t_{l}$ ;
(A37) follows from (A29);
(A38) follows from the definition of $n_{l}$ ;
(A40) follows since the condition $\frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}$ in $V$ yields $\frac{ε + p_{*} - p_{l}}{p_{l}} \leq \frac{1}{2}$ , which implies

$\begin{matrix} p_{l} \geq \frac{2}{3} (p_{*} + ε) . \end{matrix}$

(A44)

for all $l \in V$ ;
(A41) follows from the definitions of $α_{l}$ and $β_{l}$ in (25)–(26);
(A42) follows from the definition of $ν_{l}$ in (31);
(A43) follows from the definition of $ξ$ in (15).

Combining (A5) and (A43), it follows that

\begin{matrix} (A45) & P_{1} [C_{l} \cap A_{l}] & = P_{1} [A_{l}] - P_{1} [C_{l}^{c} \cap A_{l}] \\ (A46) & \geq 1 - 2 γ_{0} \end{matrix}

for all

l \in V

, and from (A7) and (A46), we obtain

\begin{matrix} P_{1} [C_{l} \cap A_{l}] \geq 1 - 2 γ_{0} \end{matrix}

(A47)

for all

l \in T (p, ε)

.

Finally, recall the definition of

S_{l}

in (34). From (24) and (A47), and using the union bound, we have

\begin{matrix} (A48) & P_{1} [S_{l}] & > 1 - (2 γ_{0} + \frac{1}{2}) \\ (A49) & = \frac{1 - 4 γ_{0}}{2} \end{matrix}

for all

l \in T (p, ε)

, as desired.

Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)

We first state the following lemma, which can easily be verified graphically, or proved using basic calculus.

Lemma A2.

For any

x \in [0, 1)

, the following holds:

\begin{matrix} 1 - x \geq exp (- \frac{x}{\sqrt{1 - x}}) . \end{matrix}

(A50)

To prove Proposition 1, we consider two cases:

Case 1: $\frac{ε + Δ_{l}}{p_{l}} > \frac{1}{2}$ . In this case, recalling that $Δ_{l} = p_{*} - p_{l}$ , we have

$\begin{matrix} \frac{ε + p_{*}}{p_{l}} & = \frac{ε + Δ_{l}}{p_{l}} + 1 > \frac{3}{2} > 1 . \end{matrix}$

(A51)

On the other hand, since $ε + p_{l} \leq ε + p_{*} < 1$ , we have

$\begin{matrix} (A52) & 0 < \frac{ε + Δ_{l}}{1 - p_{l}} & = \frac{ε + p_{*} - p_{l}}{1 - p_{l}} \\ (A53) & = 1 - \frac{1 - (p_{*} + ε)}{1 - p_{l}}, \end{matrix}$

and applying Lemma A2 gives

$\begin{matrix} (A54) & \frac{1 - ε - p_{*}}{1 - p_{l}} & = 1 - \frac{ε + Δ_{l}}{1 - p_{l}} \\ (A55) & \geq exp [- (\sqrt{\frac{1 - p_{l}}{1 - (p_{*} + ε)}}) (\frac{ε + Δ_{l}}{1 - p_{l}})] \\ (A56) & = exp (- \frac{ε + Δ_{l}}{\sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}}) . \end{matrix}$

Moreover, by the definition of $α_{l}$ in (25), we have

$\begin{matrix} α_{l} = \frac{ε + Δ_{l}}{(1 - p_{l}) p_{l}} > \frac{1}{2 (1 - p_{l})}, \end{matrix}$

(A57)

since $\frac{ε + Δ_{l}}{p_{l}} > \frac{1}{2}$ . It follows from (A57) that

$\begin{matrix} α_{l} < 2 α_{l}^{2} (1 - p_{l}) . \end{matrix}$

(A58)

In addition, again using $\frac{ε + Δ_{l}}{p_{l}} > \frac{1}{2}$ , we have

$\begin{matrix} p_{l} < 2 (ε + Δ_{l}), \end{matrix}$

(A59)

and hence

$\begin{matrix} (A60) & p_{*} + ε & = (ε + Δ_{l}) + p_{l} \\ (A61) & < 3 (ε + Δ_{l}) . \end{matrix}$

We can now lower bound the likelihood ratio $L_{l} (W)$ as follows:

$\begin{matrix} (A62) & L_{l} (W) & = {(\frac{ε + p_{*}}{p_{l}})}^{K_{l}} {(\frac{1 - ε - p_{*}}{1 - p_{l}})}^{T_{l} - K_{l}} \\ (A63) & \geq exp (- \frac{ε + Δ_{l}}{\sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} (T_{l} - K_{l})) \\ (A64) & \geq exp (- \frac{ε + Δ_{l}}{\sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}) \\ (A65) & = exp (- \frac{(p_{*} + ε) (ε + Δ_{l})}{(p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}) \\ (A66) & \geq exp (- \frac{3 {(ε + Δ_{l})}^{2}}{(p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}) \\ (A67) & = exp (- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{(p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}), \end{matrix}$

where (A63) follows from (A51) and (A56), (A66) follows from (A61), and (A67) follows by the definition of $α_{l}$ in (25). Hence, (41) holds for this case in which $G_{2, l} = 0$ (and also using $3 \leq \frac{7}{2}$ ).
Case 2: $0 \leq \frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}$ . For this case, we have

$\begin{matrix} (A68) & L_{l} (W) & = {(\frac{ε + p_{*}}{p_{l}})}^{K_{l}} {(\frac{1 - ε - p_{*}}{1 - p_{l}})}^{T_{l} - K_{l}} \\ (A69) & = {(1 + \frac{ε + Δ_{l}}{p_{l}})}^{K_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{T_{l} - K_{l}} \\ (A70) & = {(1 - {(\frac{ε + Δ_{l}}{p_{l}})}^{2})}^{K_{l}} {(1 - \frac{ε + Δ_{l}}{p_{l}})}^{- K_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{T_{l} - K_{l}} \\ (A71) & = {(1 - {(\frac{ε + Δ_{l}}{p_{l}})}^{2})}^{K_{l}} {(1 - \frac{ε + Δ_{l}}{p_{l}})}^{- K_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{K_{l} (1 - p_{l}) / p_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{(p_{l} T_{l} - K_{l}) / p_{l}}, \end{matrix}$

where (A69) follows from $Δ_{l} = p_{*} - p_{l}$ along with (A53), and (A70) follows since $1 - a^{2} = (1 - a) (1 + a)$ .
From (A53), we have

$\begin{matrix} 0 < {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2} = {(1 - \frac{1 - (p_{*} + ε)}{1 - p_{l}})}^{2} \leq 1 - \frac{1 - (p_{*} + ε)}{1 - p_{l}} < 1 . \end{matrix}$

(A72)

Hence, by Lemma A2, we have

$\begin{matrix} (A73) & 1 - {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2} & \geq exp [- \frac{1}{\sqrt{1 - {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2}}} {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2}] \\ (A74) & \geq exp [- \sqrt{\frac{1 - p_{l}}{1 - p_{*} - ε}} {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2}] \\ (A75) & = exp [- (\frac{1 - p_{l}}{\sqrt{(1 - p_{l}) (1 - p_{*} - ε)}}) {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2}], \end{matrix}$

where (A74) follows from (A72).
For the third term in (A71), we proceed as follows:

$\begin{matrix} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{K_{l} (1 - p_{l}) / p_{l}} \\ (A76) & = {(1 - {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2})}^{K_{l} (1 - p_{l}) / p_{l}} {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A77) & \geq exp [- (\frac{1 - p_{l}}{\sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}}) {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2} (\frac{K_{l} (1 - p_{l})}{p_{l}})] {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A78) & = exp [- \frac{{(1 - p_{l})}^{2}}{p_{l} \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2} K_{l}] {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A79) & \geq exp [- \frac{3 {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} {(\frac{ε + Δ_{l}}{1 - p_{l}})}^{2} K_{l}] {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A80) & = exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} K_{l}] {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A81) & \geq exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] {(1 + \frac{ε + Δ_{l}}{1 - p_{l}})}^{- K_{l} (1 - p_{l}) / p_{l}} \\ (A82) & \geq exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] exp [- (\frac{ε + Δ_{l}}{1 - p_{l}}) (\frac{K_{l} (1 - p_{l})}{p_{l}})] \\ (A83) & = exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] exp [- (\frac{ε + Δ_{l}}{p_{l}}) K_{l}], \end{matrix}$

where (A76) uses $1 - a^{2} = (1 - a) (1 + a)$ , (A77) follows from (A75), (A79) follows from (A44), (A80) follows by definition of $α_{l}$ in (25), (A81) follows from the fact that $K_{l} \leq T_{l}$ , and (A82) follows from the fact that ${(1 + x)}^{- y} \geq exp (- x y)$ for all $0 \leq x$ and $y \geq 0$ .
On the other hand, observe that

$\begin{matrix} {(1 - \frac{ε + Δ_{l}}{p_{l}})}^{- K_{l}} \geq exp [(\frac{ε + Δ_{l}}{p_{l}}) K_{l}] \end{matrix}$

(A84)

since ${(1 - x)}^{- y} \geq exp (x y)$ for all $0 \leq x \leq 1$ and $y \geq 0$ . It follows from (A83) and (A84) that

$\begin{matrix} {(1 - \frac{ε + Δ_{l}}{p_{l}})}^{- K_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{K_{l} (1 - p_{l}) / p_{l}} \geq exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}], \end{matrix}$

(A85)

and it follows from (A71) and (A85) that

$\begin{matrix} L_{l} (W) & \geq {(1 - {(\frac{ε + Δ_{l}}{p_{l}})}^{2})}^{K_{l}} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{(p_{l} T_{l} - K_{l}) / p_{l}} \\ \times exp [- \frac{3}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2} T_{l}] . \end{matrix}$

(A86)

Now, since $0 < {(\frac{ε + Δ_{l}}{p_{l}})}^{2} \leq \frac{1}{4} = 1 - \frac{3}{4}$ (since we are in the case $0 \leq \frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}$ ), by Lemma A2, we have

$\begin{matrix} (A87) & {(1 - {(\frac{ε + Δ_{l}}{p_{l}})}^{2})}^{K_{l}} & \geq exp [- \sqrt{\frac{4}{3}} {(\frac{ε + Δ_{l}}{p_{l}})}^{2} K_{l}] \\ (A88) & \geq exp [- \frac{4}{3} {(\frac{ε + Δ_{l}}{p_{l}})}^{2} K_{l}] \\ (A89) & = exp [- \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} K_{l}] \\ (A90) & = exp [\frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} (p_{l} T_{l} - K_{l})] exp [- \frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} p_{l} T_{l}], \end{matrix}$

where (A89) follows from the definition of $α_{l}$ in (25).
We now consider two further sub-cases:
(i)
If $p_{l} T_{l} > K_{l}$ , then we have

$\begin{matrix} (A91) & {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{(p_{l} T_{l} - K_{l}) / p_{l}} & \geq exp (- (\sqrt{\frac{1 - p_{l}}{1 - (p_{*} + ε)}}) (\frac{ε + Δ_{l}}{1 - p_{l}}) (\frac{p_{l} T_{l} - K_{l}}{p_{l}})) \\ (A92) & = exp (- (\sqrt{\frac{1 - p_{l}}{1 - (p_{*} + ε)}}) α_{l} (p_{l} T_{l} - K_{l})) \\ (A93) & = exp (- β_{l} (p_{l} T_{l} - K_{l})), \end{matrix}$

where (A91) follows from Lemma A2 along with (A53), and (A93) follows from the definition of $β_{l}$ in (26).
(ii)
If $p_{l} T_{l} \leq K_{l}$ , then we have

$\begin{matrix} (A94) & {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{(p_{l} T_{l} - K_{l}) / p_{l}} & \geq exp [- (\frac{ε + Δ_{l}}{1 - p_{l}}) (\frac{p_{l} T_{l} - K_{l}}{p_{l}})] \\ (A95) & = exp [- α_{l} (p_{l} T_{l} - K_{l})], \end{matrix}$

where (A94) follows from the fact that ${(1 - x)}^{y} \geq exp (- x y)$ if $1 \geq x \geq 0$ and $y \leq 0$ .
From (A93) and (A95), we obtain

$\begin{matrix} {(1 - \frac{ε + Δ_{l}}{1 - p_{l}})}^{(p_{l} T_{l} - K_{l}) / p_{l}} \geq exp [- (β_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + α_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}})] . \end{matrix}$

(A96)

Now, from (A86), (A90), and (A96), we have

$\begin{matrix} L_{l} (W) & \geq exp [\frac{4}{3} {(α_{l} (1 - p_{l}))}^{2} (p_{l} T_{l} - K_{l})] exp [- \frac{4}{3} α_{l}^{2} p_{l} {(1 - p_{l})}^{2} T_{l}] \\ \times exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] \\ (A97) & \times exp [- (β_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + α_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}})] \\ = exp [- \frac{4}{3} α_{l}^{2} p_{l} {(1 - p_{l})}^{2} T_{l}] exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] \\ (A98) & \times exp [- ({\tilde{β}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}})] \\ \geq exp [- \frac{2}{(p_{*} + ε)} α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2} T_{l}] exp [- \frac{3 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] \\ (A99) & \times exp [- ({\tilde{β}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}})] \\ \geq exp [- \frac{7 α_{l}^{2} p_{l}^{2} {(1 - p_{l})}^{2}}{2 (p_{*} + ε) \sqrt{(1 - p_{l}) (1 - (p_{*} + ε))}} T_{l}] \\ (A100) & \times exp [- ({\tilde{β}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} > K_{l}} + {\tilde{α}}_{l} (p_{l} T_{l} - K_{l}) 1 {p_{l} T_{l} \leq K_{l}})], \end{matrix}$

where (A98) follows from the definitions of ${\tilde{α}}_{l}$ and ${\tilde{β}}_{l}$ in (27)–(28), (A99) follows by writing $\frac{4}{3} α_{l}^{2} p_{l} (1 - p_{l}^{2}) = \frac{4}{3 p_{l}} α_{l}^{2} p_{l}^{2} (1 - p_{l}^{2})$ and applying (A44), and (A100) uses $\sqrt{(1 - p_{l}) (1 - p^{*} + ϵ)} \leq 1$ . Hence, (41) also holds in this case, and the proof is complete.

Appendix C. Differences in Analysis Techniques

Here we briefly overview some of the main differences in our analysis techniques compared to [14], leading to the improvements highlighted in Section 2:

We remove the restriction $p_{l} \geq \frac{ε + p_{*}}{1 + \sqrt{\frac{1}{2}}}$ (or $\frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{\sqrt{2}}$ ) used in the subsets $M (p, ε)$ and $N (p, ε)$ in (Equations (4) and (5) [14]), so that our lower bound depends on all of the arms. To achieve this, our analysis frequently needs to handle the cases $\frac{ε + Δ_{l}}{p_{l}} > \frac{1}{2}$ and $\frac{ε + Δ_{l}}{p_{l}} \leq \frac{1}{2}$ separately (e.g., see the proof of Proposition 1).
The preceding separation into two cases also introduces further difficulties. For example, our definition of $G_{2, l}$ in (30) is modified to contain different constants for the cases $p_{l} T_{l} > K_{l}$ and $p_{l} T_{l} \leq K_{l}$ , which is not the case in (Lemma 2 [14]). Accordingly, the quantities ${\tilde{α}}_{l}$ in (27) and ${\tilde{β}}_{l}$ in (28) appear in our proof but not in [14].
We replace the inequality ${(1 - x)}^{y} \geq e^{- 1.78 x y}$ (for $x \in (0, \frac{1}{\sqrt{2}})$ and $y \geq 0$ ) (Lemma 3 [14]) by Lemma A2. By using this stronger inequality, we can improve the constant term $c_{1}$ from $O ({\underset{̲}{p}}^{2})$ to ${(p^{*} + ε)}^{2}$ . In addition, Lemma A2 does not require the assumption $x \leq \frac{1}{\sqrt{2}}$ as in (Lemma 3 [14]), so we can use it for the case $p_{*} > \frac{1}{2}$ , which required a separate analysis in [14].
To further reduce the constant term from ${(p^{*} + ε)}^{2}$ to $(p^{*} + ε)$ (see Theorem 1), we also need to use other mathematical tricks to sharpen certain inequalities, such as (A83).

References

Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, to appear.
Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. 2015, 30, 199–215. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010. [Google Scholar]
Awerbuch, B.; Kleinberg, R.D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Symposium of Theory of Computing (STOC04), Chicago, IL, USA, 5–8 June 2004. [Google Scholar]
Shen, W.; Wang, J.; Jiang, Y.G.; Zha, H. Portfolio Choices with Orthogonal Bandit Learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15), Bengaluru, India, 25–31 July 2015. [Google Scholar]
Bechhofer, R.E. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics 1958, 14, 408–429. [Google Scholar] [CrossRef]
Paulson, E. A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Stat. 1964, 35, 174–180. [Google Scholar] [CrossRef]
Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia, 8–10 July 2002. [Google Scholar]
Kalyanakrishnan, S.; Tewari, A.; Auer, P.; Stone, P. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
Gabillon, V.; Ghavamzadeh, M.; Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. On finding the largest mean among many. arXiv 2013, arXiv:1306.3917. [Google Scholar]
Karnin, Z.; Koren, T.; Somekh, O. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. lil’UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits. arXiv 2013, arXiv:1312.7308. [Google Scholar]
Mannor, S.; Tsitsiklis, J.N. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. J. Mach. Learn. Res. 2004, 5, 623–648. [Google Scholar]
Kaufmann, E.; Cappé, O.; Garivier, A. On the Complexity of Best-arm Identification in Multi-armed Bandit Models. J. Mach. Learn. Res. 2016, 17, 1–42. [Google Scholar]
Carpentier, A.; Locatelli, A. Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem. In Proceedings of the Conference On Learning Theory, New York, NY, USA, 23–26 June 2016. [Google Scholar]
Chen, L.; Li, J.; Qiao, M. Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Simchowitz, M.; Jamieson, K.G.; Recht, B. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime. arXiv 2013, arXiv:abs/1702.05186. [Google Scholar]
Bubeck, S.; Bianchi, N.C. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning; Now Publishers Inc.: Hanover, MA, USA, 2012; Volume 5. [Google Scholar]
Royden, H.; Fitzpatrick, P. Real Analysis, 4th ed.; Pearson: New York, NY, USA, 2010. [Google Scholar]
Katariya, S.; Jain, L.; Sengupta, N.; Evans, J.; Nowak, R. Adaptive Sampling for Coarse Ranking. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Lanzarote, Spain, 9–11 April 2018. [Google Scholar]
Billingsley, P. Probability and Measure, 3rd ed.; Wiley-Interscience: Hoboken, NJ, USA, 1995. [Google Scholar]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Truong, L.V.; Scarlett, J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy 2020, 22, 788. https://doi.org/10.3390/e22070788

AMA Style

Truong LV, Scarlett J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy. 2020; 22(7):788. https://doi.org/10.3390/e22070788

Chicago/Turabian Style

Truong, Lan V., and Jonathan Scarlett. 2020. "On Gap-Based Lower Bounding Techniques for Best-Arm Identification" Entropy 22, no. 7: 788. https://doi.org/10.3390/e22070788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Gap-Based Lower Bounding Techniques for Best-Arm Identification

Abstract

1. Introduction

2. Overview of Results

2.1. Problem Setup

2.2. Existing Lower Bounds

2.3. Our Result and Discussion

3. Proof of Theorem 1

4. Conclusion

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough $E_{1}$ [ $G_{1, l}$ ])

Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)

Appendix C. Differences in Analysis Techniques

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

On Gap-Based Lower Bounding Techniques for Best-Arm Identification

Abstract

1. Introduction

2. Overview of Results

2.1. Problem Setup

2.2. Existing Lower Bounds

2.3. Our Result and Discussion

3. Proof of Theorem 1

4. Conclusion

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough E 1 [ G 1 , l ])

Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)

Appendix C. Differences in Analysis Techniques

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough $E_{1}$ [ $G_{1, l}$ ])