Next Article in Journal
Looking at Extremes without Going to Extremes: A New Self-Exciting Probability Model for Extreme Losses in Financial Markets
Next Article in Special Issue
A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning
Previous Article in Journal
Microstructure Evolution and Mechanical Properties of FeCoCrNiCuTi0.8 High-Entropy Alloy Prepared by Directional Solidification
Previous Article in Special Issue
Information Perspective to Probabilistic Modeling: Boltzmann Machines versus Born Machines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Gap-Based Lower Bounding Techniques for Best-Arm Identification

1
Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK
2
Department of Computer Science & Department of Mathematics, National University of Singapore, Singapore 117418, Singapore
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(7), 788; https://doi.org/10.3390/e22070788
Submission received: 16 June 2020 / Revised: 14 July 2020 / Accepted: 17 July 2020 / Published: 20 July 2020
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science II)

Abstract

:
In this paper, we consider techniques for establishing lower bounds on the number of arm pulls for best-arm identification in the multi-armed bandit problem. While a recent divergence-based approach was shown to provide improvements over an older gap-based approach, we show that the latter can be refined to match the former (up to constant factors) in many cases of interest under Bernoulli rewards, including the case that the rewards are bounded away from zero and one. Together with existing upper bounds, this indicates that the divergence-based and gap-based approaches are both effective for establishing sample complexity lower bounds for best-arm identification.

1. Introduction

The multi-armed bandit (MAB) problem [1] provides a versatile framework for sequentially searching for high-reward actions, with applications including clinical trials [2], online advertising [3], adaptive routing [4], and portfolio design [5]. The best-arm identification problem seeks to find the arm with the highest mean using as few arm pulls as possible, and dates back to the works of Bechhofer [6] and Paulson [7]. More recently, several algorithms have been proposed for best-arm identification, including successive elimination [8], lower-upper confidence bound algorithms [9,10], PRISM [11], and gap-based elimination [12]. The latter establishes a sample complexity that is known to be optimal in the two-arm case [13], and more generally near-optimal.
Complementary to these upper bounds is information-theoretic lower bounds on the performance of any algorithm. Such bounds serve as a means to assess the degree of optimality of practical algorithms, and identify where further improvements are possible, thus focusing research towards directions that can have the greatest practical impact. Lower bounds were given by Mannor and Tsitsiklis [14] for Bernoulli bandits, and by Kaufmann et al. [15] for more general reward distributions. Both of these works were based on the difficulty of distinguishing bandit instances that differ in only a single arm distribution, but the subsequent analysis techniques differed significantly, with [14] using a direct change-of-measure analysis and introducing gap-based quantities equaling the difference between two arm means, and [15] using a form of the data processing inequality for KL divergence. We refer to these as the gap-based and divergence-based approaches, respectively. Further works on best-arm identification lower bounds include [16,17,18].
The divergence-based approach was shown in [15] to attain a stronger result than that of [14] with a simpler proof, as we outline in Section 2.2. In this paper, we address the question of whether the gap-based approach is fundamentally limited, or can be refined to attain a similar results to [15]. We show that the correct answer is the latter in many cases of interest, by suitable refinements of the analysis of [14]. The existing results and our results are presented in Section 2, and our analysis is presented in Section 3.

2. Overview of Results

2.1. Problem Setup

We consider the following setup:
  • There are M arms with Bernoulli rewards; the means are p = ( p 1 , p 2 , , p M ) , and this set of means is said to define the bandit instance. Our analysis will consider instances with arms sorted such that p 1 p 2 p M , without loss of generality.
  • The agent would like to find an arm whose arm mean is within ϵ of the highest arm mean for some 0 < ε < 1 , i.e., p l > p 1 ε . Even if there are multiple such arms, just identifying one of them is good enough.
  • In each round, the agent can pull any arm l [ M ] and observe an reward X l ( s ) B e r n o u l l i ( p l ) , where s is the number of times the l-th arm has been pulled so far. We assume that the rewards are independent, both across arms and across times.
  • In each round, the agent can alternatively choose to terminate and output an arm index l ^ believed to be ϵ -optimal. The index at which this occurs is denoted by T, and is a random variable because it is allowed to depend on the rewards observed. We are interested in the expected number of arm pulls (also called the sample complexity) E p [ T ] for a given instance p , which should ideally be as low as possible.
  • An algorithm is said to be ( ε , δ ) -PAC (Probably Approximately Correct) if, for all bandit instances, it outputs an ε -optimal arm with probability at least 1 δ when it terminates at the stopping time T.
We will frequently make use of some fundamental quantities. First, the best arm mean and the gap to the best arm are denoted by
p * : = p 1 ,
Δ l : = p * p l .
The set of ϵ -optimal arms and the set of ϵ -suboptimal arms are respectively given by
M ( p , ε ) : = { l [ M ] : p l > p * ε } ,
N ( p , ε ) : = { l [ M ] : p l p * ε } ,
and we make use of the binary KL divergence function
KL ( p , q ) : = p log p q + ( 1 p ) log 1 p 1 q ,
where here and subsequently, log ( · ) denotes the natural logarithm.

2.2. Existing Lower Bounds

For any fixed p ̲ ( 0 , 1 / 2 ) , Mannor and Tsitsiklis [14] showed that if an algorithm is ( ϵ , δ ) -PAC with respect to all instances with min l p l p ̲ > 0 , and if ϵ 1 p * 4 and δ e 8 / 8 , then for any constant α ( 0 , 2 ) , there exists c 1 = O ( p ̲ 2 ) (depending on α ) such that
E p [ T ] c 1 | M ˜ ( p , ε ) | 1 + ε 2 + l N ˜ ( p , ε ) 1 Δ l 2 log 1 8 δ
where
M ˜ ( p , ε ) = M ( p , ε ) l [ M ] : p l ε + p * 2 α ,
N ˜ ( p , ε ) = N ( p , ε ) l [ M ] : p l ε + p * 2 α .
Note that the subsets M ˜ ( p , ε ) and N ˜ ( p , ε ) do not always form a partition of the arms, i.e., it may hold that M ˜ ( p , ε ) N ˜ ( p , ε ) [ M ] . The sets increase in size as α decreases, but implicitly this leads to a lower value of c 1 . In addition, as we will see below, the p ̲ 2 dependence entering via c 1 is not necessary.
We also note that the lower bound in (6) depends on the instance-specific quantities M ˜ ( p , ε ) , N ˜ ( p , ε ) , and Δ l , and is thus an instance-dependent bound. On the other hand, the lower bound is only stated for ( ε , δ ) -PAC algorithms, and the PAC guarantee requires the algorithm to eventually succeed on any instance (subject to the assumptions given on p l , ϵ , and δ ).
Kaufmann et al. [15] improved Mannor and Tsitsiklis’s lower bound by using a form of data processing inequality for KL divergence, leading to the following whenever δ 0.15 and 0 < ε < min { p * , 1 p * } [15] (Remark 5):
E p [ T ] | M ( p , ε ) | 1 KL ( p * ε , p * + ε ) + l N ( p , ε ) 1 KL ( p l , p * + ε ) log 1 2.4 δ .
To directly compare this result with (6), it is useful to apply the following inequality [19] (Equation (2.8)):
2 ( p q ) 2 KL ( p , q ) ( p q ) 2 q ( 1 q ) ,
which yields
E p [ T ] ( p * + ϵ ) ( 1 p * ϵ ) | M ( p , ε ) | 1 4 ϵ 2 + l N ( p , ε ) 1 ( ε + Δ l ) 2 log 1 2.4 δ .
Even this weakened bound can significantly improve on (6), since (i) M ( p , ε ) M ˜ ( p , ε ) and N ( p , ε ) N ˜ ( p , ε ) , (ii) the p ̲ 2 dependence is replaced by ( p * + ϵ ) ( 1 p * ϵ ) , so the dependence on the smallest arm mean is avoided (The 1 p * ϵ term is potentially small when ϵ is close to 1 p * , but since (6) assumes ϵ 1 p * 4 , we can still say that (11) is at least as good as (6)), and (iii) the assumption ϵ 1 p * 4 is avoided.

2.3. Our Result and Discussion

Our lower bound, stated in the following theorem, is developed based on Mannor and Tsitsiklis’s analysis for best-arm identification [14] (Theorem 1), but uses novel refinements of the techniques therein to further optimize the bound (see Appendix C for an overview of these refinements).
Theorem 1.
For any bandit instance p ( 0 , p * ] M with p * ( 0 , 1 ) , and any ( ε , δ ) -PAC algorithm with 0 < ε < 1 p * and 0 < δ < δ 0 for some δ 0 < 1 / 4 , we have
E p [ T ] 2 γ 0 ( p * + ε ) ( 1 p * ε ) 7 ( ξ + 1 ) | M ( p , ε ) | 1 4 ε 2 + l N ( p , ε ) 1 ( ε + Δ l ) 2 log 1 + 4 δ 0 4 δ ,
where
γ 0 = 1 4 δ 0 8 ,
θ = 2 δ 1 4 γ 0 = 4 δ 1 + 4 δ 0 ,
and ξ > 0 is the unique positive solution of the following quadratic equation:
7 γ 0 ξ 2 log 1 θ = 3 ( ξ + 1 ) .
Observe that this result matches (11) (with modified constants), and therefore exhibits the above benefit of depending on the full sets M and N without the condition p l ε + p * 2 α (see (7)–(8)), as well as avoiding the dependence on p ̲ , and permitting the broadest range of ϵ and δ among the above results.
The result (11) in turn matches (9) whenever the right-hand inequality in (10) is tight (i.e., whenever KL ( p , q ) = Θ ( p q ) 2 q ( 1 q ) ). This is clearly true when p and q (representing the arm means) are bounded away from zero and one, and also in certain limiting cases approaching these endpoints (e.g., when p and q both tend to one, but 1 p 1 q = Θ ( 1 ) ). However, there are also limiting cases where the upper bound in (10) is not tight (e.g., p = 1 η and q = 1 η as η 0 ), and in such cases, the bound (9) remains tighter than that of Theorem 1.

3. Proof of Theorem 1

We follow the general steps of (Theorem 5 [14]), but with several refinements to improve the final bound. The main differences are outlined in Appendix C.
Step 1: Defining a Hypothesis Test
Let us denote the true (unknown) expected reward of each arm by Q l for all l [ M ] . Similarly to [14,15], we consider M hypotheses as follows:
H 1 : Q l = p l , l [ M ] ,
and for each l 1 ,
H l : Q l = p * + ε , Q l = p l l [ M ] \ { l } .
If hypothesis H l is true, the ( ϵ , δ ) -PAC algorithm must return arm l with probability at least 1 δ . We will bound sample complexity when the hypothesis H l is true. We denote by E l and P l the expectation and probability, respectively, under hypothesis H l .
Let B l be the event that the algorithm returns arm l. Since l M ( p , ε ) P 1 ( B l ) 1 and | M ( p , ε ) | 1 , there is at most one arm l 0 M ( p , ε ) that satisfies P 1 ( B l 0 ) > 1 2 . Defining
M 0 ( p , ε ) : = l M ( p , ε ) : P 1 [ B l ] 1 2 = l M ( p , ε ) : l l 0 ,
it follows that
| M 0 ( p , ε ) | ( | M ( p , ε ) | 1 ) + .
Define
T ( p , ε ) : = M 0 ( p , ε ) N ( p , ε ) ,
as well as
B M ( p , ε ) : = l M ( p , ε ) B l ,
which is the event that the policy eventually select an arm in the ε -neighborhood of the best arm in [ M ] . Since the policy is ( ε , δ ) -correct with δ < δ 0 , we must have
P 1 [ B M ( p , ε ) ] 1 δ > 1 δ 0 ,
and it follows from (18) and (22) that
P 1 [ B l ] max δ 0 , 1 / 2
= 1 2
for all l T ( p , ε ) .
Step 2: Bounding the Number of Pulls of Each Arm
Before proceeding, we make some additional definitions:
α l : = ε + Δ l ( 1 p l ) p l ,
β l : = α l 1 p l 1 ( p * + ε ) ,
α ˜ l : = α l 4 3 α l ( 1 p l ) 2 ,
β ˜ l : = β l 4 3 α l ( 1 p l ) 2 ,
The definitions (27) and (28) will only be used for arms with ε + Δ l p l 1 2 , and for such arms, we will establish in the analysis that α ˜ l 0 and β ˜ l 0 .
We prove the following lemma, characterizing the probability of a certain event in which (i) the number of pulls of some arm l T ( p , ε ) falls below a suitable threshold (event A l below), (ii) a deviation bound holds regarding the number of observed 1’s from pulling arm l (event C l below), and (iii) arm l is not returned (event B l c ).
Lemma 1.
For each l [ M ] , let T l be the total number of times that arm l is pulled under the ( ε , δ ) -correct policy. Let K l = X l ( 1 ) + X l ( 2 ) + + X l ( T l ) be the total number of unit rewards obtained from pulling the arm l up to the T l -th time. Let
G 1 , l : = 7 p l 2 α l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l ,
G 2 , l : = β ˜ l ( p l T l K l ) 1 { p l T l > K l } + α ˜ l ( p l T l K l ) 1 { p l T l K l } 1 ε + Δ l p l 1 2 ,
where α l , α ˜ l , and β ˜ l are defined in (25), (27), and (28), respectively. Let
ν l : = ( ξ + 1 ) 1 p l 1 p * ε ,
where ξ is defined in (15). Define the following events:
A l : = G 1 , l 1 ν l log 1 θ ,
C l = G 2 , l ξ ν l 1 p l 1 p * ε log 1 θ ,
S l : = A l B l c C l .
If l T ( p , ε ) (see (20)), then under the condition that
E 1 G 1 , l < γ 0 ν l log 1 θ ,
we have
P 1 S l > 1 4 γ 0 2 .
Proof. 
See Appendix A. □
Intuitively, A l is the event that the total number of times that arm l is pulled is small, and C l is the event that | p l T l K l | is not too large (since pulling an arm T l times should produce roughly p l T l ones). The lemma indicates that if E [ T l ] is not too large, then P [ A l B l c C l ] is lower bounded, and this will ultimately lead to a lower bound on P [ B l c ] , the event of primary interest.
In Lemma 2 below, we will use Lemma 1 to deduce a lower bound on E 1 [ G 1 , l ] , which amounts to a lower bound on the average number of arm pulls by the definition of G 1 , l . Before doing so, we introduce a likelihood ratio that will be used in a change-of-measure argument [14].
For any given time t 1 and l [ M ] , let T l ( t ) be the total number of times that arm l is pulled by time t. Define
X l T l ( t ) : = { X l ( 1 ) , X l ( 2 ) , , X l ( T l ( t ) ) } ,
and let
F t : = σ ( X 1 T 1 ( t ) , X 2 T 2 ( t ) , , X M T M ( t ) )
be the σ -algebra generated by X 1 T 1 ( t ) , X 2 T 2 ( t ) , , X M T M ( t ) for all t = 1 , 2 , .
Recall that T is the stopping time of the algorithm, and that T l : = T l ( T ) for all l [ M ] . Moreover, let W = F T be the entire history up to the stopping time T. We define the following likelihood ratio:
L l ( w ) = P l ( W = w ) P 1 ( W = w )
for every possible history w. Moreover, we let L l ( W ) denote the corresponding random variable. Given the history up to time T 1 (i.e., F T 1 ), the arm reward at time T has the same probability distribution under H 1 and H l unless the chosen arm is arm l. Therefore, we have
L l ( W ) = ( p * + ε ) K l ( 1 p * ε ) T l K l p l K l ( 1 p l ) T l K l ,
where K l : = X l ( 1 ) + X l ( 2 ) + + X l ( T l ) (or the total number of 1’s in the T l pulls of the arm l).
The following proposition presents one of our key technical results towards establishing the lower bound. We use the definitions in (1)–(5), along with (25)–(28).
Proposition 1.
Fix the bandit instance p , the parameter 0 < ε < 1 p * , and the history W with corresponding values K l and T l . Recalling the definitions of α l , α ˜ l , and β ˜ l in (25), (27), and (28), respectively, we have
L l ( W ) exp G 1 , l G 2 , l ,
where G 1 , l and G 2 , l are defined in (29)(30).
Proof. 
See Appendix B. □
Based on Lemma 1 and Proposition 1, we obtain the following extension of [14] (Lemma 6) lower bounding the average of each G 1 , l ; this lower bound will later be translated to a lower bound on the number of arm pulls T l .
Lemma 2.
For any arm l T ( p , ε ) , the following holds:
E 1 [ G 1 , l ] γ 0 ν l log 1 θ ,
where θ and ν l are defined in (14) and (31), respectively.
Proof. 
We use a proof by contradiction. Assume that
E 1 [ G 1 , l ] < γ 0 ν l log 1 θ ,
then by Lemma 1, Equation (36) holds. Moreover, by Proposition 1, we have
L l ( W ) exp G 1 , l G 2 , l ,
and recalling the definition of S l in (34), it follows from (44) that
(45) L l ( W ) 1 S l exp G 1 , l G 2 , l 1 S l (46) exp 1 ν l ξ 1 p l 1 p * ε + 1 log 1 θ 1 S l (47) exp 1 ν l ( ξ + 1 ) 1 p l 1 p * ε log 1 θ 1 S l
where (46) follows from the definitions in (32)–(33), and (47) follows from the fact that 1 p l 1 p * 1 p * ε for all l [ M ] .
By the choice of ν l > 0 given in (31), it holds that
( ξ + 1 ) 1 p l 1 p * ε 1 ν l = 1 .
Hence, from (47) and (48), we have
L l ( W ) 1 S l θ 1 S l = 2 δ 1 4 γ 0 1 S l ,
for all l T ( p , ε ) , by the definition of θ in (14).
We are now ready to complete the proof:
(50) P l [ B l c ] P l [ S l ] (51) = E l [ 1 S l ] (52) = E 1 L l ( W ) 1 S l (53) E 1 2 δ 1 4 γ 0 1 S l (54) = 2 δ 1 4 γ 0 P 1 [ S l ] (55) > 2 δ 1 4 γ 0 1 4 γ 0 2 (56) = δ ,
where (50) follows from the definition of set S l in (34), (52) follows by a standard change of measure [20], (53) follows from (49), and (55) follows from (36) of Lemma 1 (recall that we assumed (43)).
The inequality (56) shows a contradiction to the fact that under H l , the ( ε , δ ) -correct bandit policy must return the arm l with probability at least 1 δ , i.e., P l ( B l c ) δ . This concludes the proof. □
From Lemma 2 and the definition of G 1 , l in (29), it holds that
E 1 7 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l γ 0 ν l log 1 θ
for all l T ( p , ε ) . Hence, and using the definition of ν l in (31), we have
(58) E 1 α l 2 p l 2 ( 1 p l ) 2 T l 2 γ 0 ( p * + ε ) ( 1 p l ) ( 1 p * ε ) 7 ( 1 + ξ ) 1 p * ε 1 p l log 1 θ (59) = 2 γ 0 ( p * + ε ) ( 1 p * ε ) 7 ( 1 + ξ ) log 1 θ .
Step 3: Deducing a Lower Bound on the Sample Complexity
For any arm l T ( p , ε ) , by the definition of α l in (25), we have
α l 2 p l 2 ( 1 p l ) 2 = ( ε + Δ l ) 2 .
Note that 0 Δ l < ε for all l M 0 ( p , ε ) , since M 0 M , the set of ϵ -optimal arms. Therefore, we can further simplify (60) to
α l 2 p l 2 ( 1 p l ) 2 4 ϵ 2
for l M 0 ( p , ε ) .
Substituting (60)–(61) into (59), we obtain
(62) E p [ T ] = E p l = 1 M T l (63) E p l M 0 ( p , ε ) T l + E p l N ( p , ε ) T l (64) 2 γ 0 ( p * + ε ) ( 1 p * ε ) 7 ( ξ + 1 ) | M 0 ( p , ε ) | 4 ε 2 + l N ( p , ε ) 1 ( ε + Δ l ) 2 log 1 θ (65) = 2 γ 0 ( p * + ε ) ( 1 p * ε ) 7 ( ξ + 1 ) | M 0 ( p , ε ) | 4 ε 2 + l N ( p , ε ) 1 ( ε + Δ l ) 2 log 1 + 4 δ 0 4 δ ,
where (65) uses the definition of θ in (14). Finally, we obtain (12) from (65) and (19).

4. Conclusion

We have presented a refined analysis of best-arm identification following the gap-based approach of [14], but incorporating refinements that circumvent some weaknesses, leading to a bound matching the divergence-based approach [15] in many cases. It would be of interest to determine whether further refinements could allow this approach to match [15] in all cases, or the extent to which the gap-based approach extends beyond Bernoulli rewards and/or beyond the standard best-arm identification problem (e.g., to ranking problems [21]).

Author Contributions

L.V.T. conceptualized the problem, and established and wrote the main results and proofs. J.S. provided ongoing supervision and corrections, and assistance with the writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough E 1 [ G 1 , l ])

As in [14], the proof of this lemma is based on Kolmogorov’s maximum inequality [22].
Lemma A1.
(Kolmogorov’s Theorem [22]) Let Y 1 , Y 2 , , Y n : Ω R be independent random variables defined on a common probability space ( Ω , F , P ) with expectation E [ Y k ] = 0 and variance Var [ Y k ] < for k = 1 , 2 , , n . Then, for each λ > 0 ,
P max 1 k n | S k | λ 1 λ 2 Var [ S n ] = 1 λ 2 k = 1 n E [ Y k 2 ] ,
where S k = Y 1 + Y 2 + + Y k .
We start by simplifying the main assumption of the lemma:
(A2) γ 0 ν l log 1 θ > E 1 G 1 , l (A3) 1 ν l log 1 θ P 1 G 1 , l > 1 ν l log 1 θ (A4) = 1 ν l log 1 θ P 1 [ A l c ] ,
where (A3) follows from Markov’s inequality, and (A4) follows from the definition of A l in (32).
It follows from (A4) that
P 1 A l 1 γ 0 .
We define
V = l [ M ] : ε + Δ l p l 1 2
and we will find it convenient to treat the cases l V and ł V separately. For l V , from (30) and (33), we have A l C l = A l since θ ( 0 , 1 ) , ξ > 0 , and G 2 , l = 0 , and it immediately follows from (A5) that
P 1 A l C l 1 γ 0 .
On the other hand, for l V , we can simplify the definition of α ˜ l in (27) as follows:
(A8) α ˜ l = α l 4 3 α l ( 1 p l ) 2 (A9) = α l 4 3 α l ( 1 p l ) α l ( 1 p l ) (A10) = α l 4 3 α l ( 1 p l ) ε + Δ l p l (A11) α l 2 3 α l ( 1 p l ) (A12) α l 2 3 α l (A13) = 1 3 α l (A14) > 0 ,
where (A10) follows from (25), and (A11) follows from the definition of the set V in (A6). It follows that
0 < α ˜ l α l β l
for all l V , where the second inequality in (A15) follows from p l p * p * + ε and the definitions of α l and β l in (25) and (26), respectively.
Similarly, for l V , we can simplify β ˜ l from (28) as follows:
(A16) β ˜ l = β l 4 3 α l ( 1 p l ) 2 (A17) α l 4 3 α l ( 1 p l ) 2 (A18) = α ˜ l (A19) > 0 ,
where (A17) follows from (A15), (A18) follows from (27), and (A19) again uses (A15). It follows that
0 < β ˜ l β l
for all l V .
Now, let
Z l ( j ) : = β l ( X l ( j ) p l ) , j = 1 , 2 ,
Then, we have
E 1 [ Z l ( j ) ] = E 1 β l ( X l ( j ) p l ) = β l E 1 X l ( j ) p l = 0 .
In addition, we note that Z l ( 1 ) , Z l ( 2 ) , , are a i.i.d. sequence by the i.i.d. property of X l ( 1 ) , X l ( 2 ) , .
For each positive integer t l , let K l , t l : = j = 1 t l X l ( j ) , and define
U l : = β l ( K l p l T l ) ,
(A24) V l ( t l ) : = β l ( K l , t l p l t l ) (A25) = j = 1 t l β l ( X l ( j ) p l ) (A26) = j = 1 t l Z l ( j ) .
Observe that
(A27) j = 1 t l E 1 Z l ( j ) 2 = j = 1 t l β l 2 E 1 [ X l ( j ) p l ) 2 (A28) = j = 1 t l β l 2 p l ( 1 p l ) (A29) = t l β l 2 p l ( 1 p l ) ,
where (A28) follows since a Bernoulli ( ρ ) variable has variance ρ ( 1 ρ ) .
We are now ready to upper bound P 1 [ C l c A l ] for l V :
P 1 [ C l c A l ] (A30) = P 1 β ˜ l ( p l T l K l ) 1 { p l T l > K l } + α ˜ l ( p l T l K l ) 1 { p l T l K l } > ξ ν l 1 p l 1 p * ε log 1 θ A l (A31) P 1 β ˜ l | p l T l K l | 1 { p l T l > K l } + α ˜ l | p l T l K l | 1 { p l T l K l } > ξ ν l 1 p l 1 p * ε log 1 θ A l (A32) P 1 β l | p l T l K l | > ξ ν l 1 p l 1 p * ε log 1 θ A l (A33) = P 1 | U l | > ξ ν l 1 p l 1 p * ε log 1 θ T l < 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) 7 ν l α l 2 p l 2 ( 1 p l ) 2 log 1 θ (A34) P 1 max t l 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) log 1 θ 7 ν l α l 2 p l 2 ( 1 p l ) 2 | V l ( t l ) | > ξ ν l 1 p l 1 p * ε log 1 θ (A35) = P 1 max t l 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) log 1 θ 7 ν l α l 2 p l 2 ( 1 p l ) 2 | j = 1 t l Z j | > ξ ν l 1 p l 1 p * ε log 1 θ ,
where:
  • (A30) uses the definitions of C l and G 2 , l ;
  • (A32) follows from (A15) and (A20), along with 1 { p l T l > K l } + 1 { p l T l K l } = 1 ;
  • (A33) uses the definitions of U l and A l ;
  • (A34) follows from the definitions of U l and V l ( t l ) in (A23) and (A24) (which imply U l = V l ( T l ) );
  • (A35) follows from (A26);
Defining n l = 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) log 1 θ 7 ν l α l 2 p l 2 ( 1 p l ) 2 for brevity, we continue from (A35) as follows:
(A36) P 1 [ C l c A l ] max t l n l j = 1 t l E 1 Z j 2 ξ ν l 1 p l 1 p * ε log 1 θ 2 (A37) = max t l n l p l ( 1 p l ) t l ξ ν l 1 p l 1 p * ε log 1 θ 2 β l 2 (A38) 2 ν l ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) 7 ξ 2 1 p l 1 p * ε p l ( 1 p l ) log 1 θ β l α l 2 (A39) = 2 ν l ( p * + ε ) 1 ( p * + ε ) 1 p l 7 ξ 2 1 p l 1 p * ε p l log 1 θ β l α l 2 (A40) 3 ν l 1 ( p * + ε ) 1 p l 7 ξ 2 1 p l 1 p * ε log 1 θ β l α l 2 (A41) = 3 ν l 1 ( p * + ε ) 1 p l 7 ξ 2 1 p l 1 p * ε log 1 θ 1 p l 1 p * ε (A42) = 3 ( ξ + 1 ) 7 ξ 2 log 1 θ (A43) = γ 0 ,
where:
  • (A36) follows from Lemma A1 with n = n l and k = t l ;
  • (A37) follows from (A29);
  • (A38) follows from the definition of n l ;
  • (A40) follows since the condition ε + Δ l p l 1 2 in V yields ε + p * p l p l 1 2 , which implies
    p l 2 3 ( p * + ε ) .
    for all l V ;
  • (A41) follows from the definitions of α l and β l in (25)–(26);
  • (A42) follows from the definition of ν l in (31);
  • (A43) follows from the definition of ξ in (15).
Combining (A5) and (A43), it follows that
(A45) P 1 [ C l A l ] = P 1 [ A l ] P 1 [ C l c A l ] (A46) 1 2 γ 0
for all l V , and from (A7) and (A46), we obtain
P 1 [ C l A l ] 1 2 γ 0
for all l T ( p , ε ) .
Finally, recall the definition of S l in (34). From (24) and (A47), and using the union bound, we have
(A48) P 1 [ S l ] > 1 2 γ 0 + 1 2 (A49) = 1 4 γ 0 2
for all l T ( p , ε ) , as desired.

Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)

We first state the following lemma, which can easily be verified graphically, or proved using basic calculus.
Lemma A2.
For any x [ 0 , 1 ) , the following holds:
1 x exp x 1 x .
To prove Proposition 1, we consider two cases:
  • Case 1: ε + Δ l p l > 1 2 . In this case, recalling that Δ l = p * p l , we have
    ε + p * p l = ε + Δ l p l + 1 > 3 2 > 1 .
    On the other hand, since ε + p l ε + p * < 1 , we have
    (A52) 0 < ε + Δ l 1 p l = ε + p * p l 1 p l (A53) = 1 1 ( p * + ε ) 1 p l ,
    and applying Lemma A2 gives
    (A54) 1 ε p * 1 p l = 1 ε + Δ l 1 p l (A55) exp 1 p l 1 ( p * + ε ) ε + Δ l 1 p l (A56) = exp ε + Δ l ( 1 p l ) ( 1 ( p * + ε ) ) .
    Moreover, by the definition of α l in (25), we have
    α l = ε + Δ l ( 1 p l ) p l > 1 2 ( 1 p l ) ,
    since ε + Δ l p l > 1 2 . It follows from (A57) that
    α l < 2 α l 2 ( 1 p l ) .
    In addition, again using ε + Δ l p l > 1 2 , we have
    p l < 2 ( ε + Δ l ) ,
    and hence
    (A60) p * + ε = ( ε + Δ l ) + p l (A61) < 3 ( ε + Δ l ) .
    We can now lower bound the likelihood ratio L l ( W ) as follows:
    (A62) L l ( W ) = ε + p * p l K l 1 ε p * 1 p l T l K l (A63) exp ε + Δ l ( 1 p l ) ( 1 ( p * + ε ) ) ( T l K l ) (A64) exp ε + Δ l ( 1 p l ) ( 1 ( p * + ε ) ) T l (A65) = exp ( p * + ε ) ( ε + Δ l ) ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A66) exp 3 ( ε + Δ l ) 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A67) = exp 3 α l 2 p l 2 ( 1 p l ) 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l ,
    where (A63) follows from (A51) and (A56), (A66) follows from (A61), and (A67) follows by the definition of α l in (25). Hence, (41) holds for this case in which G 2 , l = 0 (and also using 3 7 2 ).
  • Case 2: 0 ε + Δ l p l 1 2 . For this case, we have
    (A68) L l ( W ) = ε + p * p l K l 1 ε p * 1 p l T l K l (A69) = 1 + ε + Δ l p l K l 1 ε + Δ l 1 p l T l K l (A70) = 1 ε + Δ l p l 2 K l 1 ε + Δ l p l K l 1 ε + Δ l 1 p l T l K l (A71) = 1 ε + Δ l p l 2 K l 1 ε + Δ l p l K l 1 ε + Δ l 1 p l K l ( 1 p l ) / p l 1 ε + Δ l 1 p l ( p l T l K l ) / p l ,
    where (A69) follows from Δ l = p * p l along with (A53), and (A70) follows since 1 a 2 = ( 1 a ) ( 1 + a ) .
    From (A53), we have
    0 < ε + Δ l 1 p l 2 = 1 1 ( p * + ε ) 1 p l 2 1 1 ( p * + ε ) 1 p l < 1 .
    Hence, by Lemma A2, we have
    (A73) 1 ε + Δ l 1 p l 2 exp 1 1 ε + Δ l 1 p l 2 ε + Δ l 1 p l 2 (A74) exp 1 p l 1 p * ε ε + Δ l 1 p l 2 (A75) = exp 1 p l ( 1 p l ) ( 1 p * ε ) ε + Δ l 1 p l 2 ,
    where (A74) follows from (A72).
    For the third term in (A71), we proceed as follows:
    1 ε + Δ l 1 p l K l ( 1 p l ) / p l (A76) = 1 ε + Δ l 1 p l 2 K l ( 1 p l ) / p l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A77) exp 1 p l ( 1 p l ) ( 1 ( p * + ε ) ) ε + Δ l 1 p l 2 K l ( 1 p l ) p l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A78) = exp ( 1 p l ) 2 p l ( 1 p l ) ( 1 ( p * + ε ) ) ε + Δ l 1 p l 2 K l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A79) exp 3 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) ε + Δ l 1 p l 2 K l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A80) = exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) K l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A81) exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l 1 + ε + Δ l 1 p l K l ( 1 p l ) / p l (A82) exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l exp ε + Δ l 1 p l K l ( 1 p l ) p l (A83) = exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l exp ε + Δ l p l K l ,
    where (A76) uses 1 a 2 = ( 1 a ) ( 1 + a ) , (A77) follows from (A75), (A79) follows from (A44), (A80) follows by definition of α l in (25), (A81) follows from the fact that K l T l , and (A82) follows from the fact that ( 1 + x ) y exp ( x y ) for all 0 x and y 0 .
    On the other hand, observe that
    1 ε + Δ l p l K l exp ε + Δ l p l K l
    since ( 1 x ) y exp ( x y ) for all 0 x 1 and y 0 . It follows from (A83) and (A84) that
    1 ε + Δ l p l K l 1 ε + Δ l 1 p l K l ( 1 p l ) / p l exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l ,
    and it follows from (A71) and (A85) that
    L l ( W ) 1 ε + Δ l p l 2 K l 1 ε + Δ l 1 p l ( p l T l K l ) / p l × exp 3 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) α l 2 p l 2 ( 1 p l ) 2 T l .
    Now, since 0 < ε + Δ l p l 2 1 4 = 1 3 4 (since we are in the case 0 ε + Δ l p l 1 2 ), by Lemma A2, we have
    (A87) 1 ε + Δ l p l 2 K l exp 4 3 ε + Δ l p l 2 K l (A88) exp 4 3 ε + Δ l p l 2 K l (A89) = exp 4 3 α l ( 1 p l ) 2 K l (A90) = exp 4 3 α l ( 1 p l ) 2 ( p l T l K l ) exp 4 3 α l ( 1 p l ) 2 p l T l ,
    where (A89) follows from the definition of α l in (25).
    We now consider two further sub-cases:
    (i)
    If p l T l > K l , then we have
    (A91) 1 ε + Δ l 1 p l ( p l T l K l ) / p l exp 1 p l 1 ( p * + ε ) ε + Δ l 1 p l p l T l K l p l (A92) = exp 1 p l 1 ( p * + ε ) α l p l T l K l (A93) = exp β l p l T l K l ,
    where (A91) follows from Lemma A2 along with (A53), and (A93) follows from the definition of β l in (26).
    (ii)
    If p l T l K l , then we have
    (A94) 1 ε + Δ l 1 p l ( p l T l K l ) / p l exp ε + Δ l 1 p l p l T l K l p l (A95) = exp α l ( p l T l K l ) ,
    where (A94) follows from the fact that ( 1 x ) y exp ( x y ) if 1 x 0 and y 0 .
    From (A93) and (A95), we obtain
    1 ε + Δ l 1 p l ( p l T l K l ) / p l exp β l ( p l T l K l ) 1 { p l T l > K l } + α l ( p l T l K l ) 1 { p l T l K l } .
    Now, from (A86), (A90), and (A96), we have
    L l ( W ) exp 4 3 α l ( 1 p l ) 2 ( p l T l K l ) exp 4 3 α l 2 p l ( 1 p l ) 2 T l × exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A97) × exp β l ( p l T l K l ) 1 { p l T l > K l } + α l ( p l T l K l ) 1 { p l T l K l } = exp 4 3 α l 2 p l ( 1 p l ) 2 T l exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A98) × exp β ˜ l ( p l T l K l ) 1 { p l T l > K l } + α ˜ l ( p l T l K l ) 1 { p l T l K l } exp 2 ( p * + ε ) α l 2 p l 2 ( 1 p l ) 2 T l exp 3 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A99) × exp β ˜ l ( p l T l K l ) 1 { p l T l > K l } + α ˜ l ( p l T l K l ) 1 { p l T l K l } exp 7 α l 2 p l 2 ( 1 p l ) 2 2 ( p * + ε ) ( 1 p l ) ( 1 ( p * + ε ) ) T l (A100) × exp β ˜ l ( p l T l K l ) 1 { p l T l > K l } + α ˜ l ( p l T l K l ) 1 { p l T l K l } ,
    where (A98) follows from the definitions of α ˜ l and β ˜ l in (27)–(28), (A99) follows by writing 4 3 α l 2 p l ( 1 p l 2 ) = 4 3 p l α l 2 p l 2 ( 1 p l 2 ) and applying (A44), and (A100) uses ( 1 p l ) ( 1 p * + ϵ ) 1 . Hence, (41) also holds in this case, and the proof is complete.

Appendix C. Differences in Analysis Techniques

Here we briefly overview some of the main differences in our analysis techniques compared to [14], leading to the improvements highlighted in Section 2:
  • We remove the restriction p l ε + p * 1 + 1 2 (or ε + Δ l p l 1 2 ) used in the subsets M ( p , ε ) and N ( p , ε ) in (Equations (4) and (5) [14]), so that our lower bound depends on all of the arms. To achieve this, our analysis frequently needs to handle the cases ε + Δ l p l > 1 2 and ε + Δ l p l 1 2 separately (e.g., see the proof of Proposition 1).
  • The preceding separation into two cases also introduces further difficulties. For example, our definition of G 2 , l in (30) is modified to contain different constants for the cases p l T l > K l and p l T l K l , which is not the case in (Lemma 2 [14]). Accordingly, the quantities α ˜ l in (27) and β ˜ l in (28) appear in our proof but not in [14].
  • We replace the inequality ( 1 x ) y e 1.78 x y (for x 0 , 1 2 and y 0 ) (Lemma 3 [14]) by Lemma A2. By using this stronger inequality, we can improve the constant term c 1 from O ( p ̲ 2 ) to ( p * + ε ) 2 . In addition, Lemma A2 does not require the assumption x 1 2 as in (Lemma 3 [14]), so we can use it for the case p * > 1 2 , which required a separate analysis in [14].
  • To further reduce the constant term from ( p * + ε ) 2 to ( p * + ε ) (see Theorem 1), we also need to use other mathematical tricks to sharpen certain inequalities, such as (A83).

References

  1. Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, to appear.
  2. Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. 2015, 30, 199–215. [Google Scholar] [CrossRef] [PubMed]
  3. Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010. [Google Scholar]
  4. Awerbuch, B.; Kleinberg, R.D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Symposium of Theory of Computing (STOC04), Chicago, IL, USA, 5–8 June 2004. [Google Scholar]
  5. Shen, W.; Wang, J.; Jiang, Y.G.; Zha, H. Portfolio Choices with Orthogonal Bandit Learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15), Bengaluru, India, 25–31 July 2015. [Google Scholar]
  6. Bechhofer, R.E. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics 1958, 14, 408–429. [Google Scholar] [CrossRef]
  7. Paulson, E. A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Stat. 1964, 35, 174–180. [Google Scholar] [CrossRef]
  8. Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia, 8–10 July 2002. [Google Scholar]
  9. Kalyanakrishnan, S.; Tewari, A.; Auer, P.; Stone, P. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
  10. Gabillon, V.; Ghavamzadeh, M.; Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
  11. Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. On finding the largest mean among many. arXiv 2013, arXiv:1306.3917. [Google Scholar]
  12. Karnin, Z.; Koren, T.; Somekh, O. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  13. Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. lil’UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits. arXiv 2013, arXiv:1312.7308. [Google Scholar]
  14. Mannor, S.; Tsitsiklis, J.N. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. J. Mach. Learn. Res. 2004, 5, 623–648. [Google Scholar]
  15. Kaufmann, E.; Cappé, O.; Garivier, A. On the Complexity of Best-arm Identification in Multi-armed Bandit Models. J. Mach. Learn. Res. 2016, 17, 1–42. [Google Scholar]
  16. Carpentier, A.; Locatelli, A. Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem. In Proceedings of the Conference On Learning Theory, New York, NY, USA, 23–26 June 2016. [Google Scholar]
  17. Chen, L.; Li, J.; Qiao, M. Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
  18. Simchowitz, M.; Jamieson, K.G.; Recht, B. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime. arXiv 2013, arXiv:abs/1702.05186. [Google Scholar]
  19. Bubeck, S.; Bianchi, N.C. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning; Now Publishers Inc.: Hanover, MA, USA, 2012; Volume 5. [Google Scholar]
  20. Royden, H.; Fitzpatrick, P. Real Analysis, 4th ed.; Pearson: New York, NY, USA, 2010. [Google Scholar]
  21. Katariya, S.; Jain, L.; Sengupta, N.; Evans, J.; Nowak, R. Adaptive Sampling for Coarse Ranking. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Lanzarote, Spain, 9–11 April 2018. [Google Scholar]
  22. Billingsley, P. Probability and Measure, 3rd ed.; Wiley-Interscience: Hoboken, NJ, USA, 1995. [Google Scholar]

Share and Cite

MDPI and ACS Style

Truong, L.V.; Scarlett, J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy 2020, 22, 788. https://doi.org/10.3390/e22070788

AMA Style

Truong LV, Scarlett J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy. 2020; 22(7):788. https://doi.org/10.3390/e22070788

Chicago/Turabian Style

Truong, Lan V., and Jonathan Scarlett. 2020. "On Gap-Based Lower Bounding Techniques for Best-Arm Identification" Entropy 22, no. 7: 788. https://doi.org/10.3390/e22070788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop