On Gap-Based Lower Bounding Techniques for Best-Arm Identification
Abstract
:1. Introduction
2. Overview of Results
2.1. Problem Setup
- There are M arms with Bernoulli rewards; the means are , and this set of means is said to define the bandit instance. Our analysis will consider instances with arms sorted such that , without loss of generality.
- The agent would like to find an arm whose arm mean is within of the highest arm mean for some , i.e., . Even if there are multiple such arms, just identifying one of them is good enough.
- In each round, the agent can pull any arm and observe an reward , where s is the number of times the l-th arm has been pulled so far. We assume that the rewards are independent, both across arms and across times.
- In each round, the agent can alternatively choose to terminate and output an arm index believed to be -optimal. The index at which this occurs is denoted by T, and is a random variable because it is allowed to depend on the rewards observed. We are interested in the expected number of arm pulls (also called the sample complexity) for a given instance , which should ideally be as low as possible.
- An algorithm is said to be -PAC (Probably Approximately Correct) if, for all bandit instances, it outputs an -optimal arm with probability at least when it terminates at the stopping time T.
2.2. Existing Lower Bounds
2.3. Our Result and Discussion
3. Proof of Theorem 1
4. Conclusion
Author Contributions
Funding
Conflicts of Interest
Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough [])
- (A30) uses the definitions of and ;
- (A33) uses the definitions of and ;
- (A34) follows from the definitions of and in (A23) and (A24) (which imply );
- (A35) follows from (A26);
- (A36) follows from Lemma A1 with and ;
- (A37) follows from (A29);
- (A38) follows from the definition of ;
- (A40) follows since the condition in yields , which implies
- (A41) follows from the definitions of and in (25)–(26);
- (A42) follows from the definition of in (31);
- (A43) follows from the definition of in (15).
Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)
- Case 1:. In this case, recalling that , we haveOn the other hand, since , we haveIn addition, again using , we have
- Case 2:. For this case, we haveFrom (A53), we haveFor the third term in (A71), we proceed as follows:On the other hand, observe thatNow, since (since we are in the case ), by Lemma A2, we haveWe now consider two further sub-cases:
- (i)
- (ii)
- If , then we have
From (A93) and (A95), we obtain
Appendix C. Differences in Analysis Techniques
- We remove the restriction (or ) used in the subsets and in (Equations (4) and (5) [14]), so that our lower bound depends on all of the arms. To achieve this, our analysis frequently needs to handle the cases and separately (e.g., see the proof of Proposition 1).
- The preceding separation into two cases also introduces further difficulties. For example, our definition of in (30) is modified to contain different constants for the cases and , which is not the case in (Lemma 2 [14]). Accordingly, the quantities in (27) and in (28) appear in our proof but not in [14].
- To further reduce the constant term from to (see Theorem 1), we also need to use other mathematical tricks to sharpen certain inequalities, such as (A83).
References
- Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, to appear.
- Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. 2015, 30, 199–215. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010. [Google Scholar]
- Awerbuch, B.; Kleinberg, R.D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Symposium of Theory of Computing (STOC04), Chicago, IL, USA, 5–8 June 2004. [Google Scholar]
- Shen, W.; Wang, J.; Jiang, Y.G.; Zha, H. Portfolio Choices with Orthogonal Bandit Learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15), Bengaluru, India, 25–31 July 2015. [Google Scholar]
- Bechhofer, R.E. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics 1958, 14, 408–429. [Google Scholar] [CrossRef]
- Paulson, E. A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Stat. 1964, 35, 174–180. [Google Scholar] [CrossRef]
- Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia, 8–10 July 2002. [Google Scholar]
- Kalyanakrishnan, S.; Tewari, A.; Auer, P.; Stone, P. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
- Gabillon, V.; Ghavamzadeh, M.; Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. On finding the largest mean among many. arXiv 2013, arXiv:1306.3917. [Google Scholar]
- Karnin, Z.; Koren, T.; Somekh, O. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. lil’UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits. arXiv 2013, arXiv:1312.7308. [Google Scholar]
- Mannor, S.; Tsitsiklis, J.N. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. J. Mach. Learn. Res. 2004, 5, 623–648. [Google Scholar]
- Kaufmann, E.; Cappé, O.; Garivier, A. On the Complexity of Best-arm Identification in Multi-armed Bandit Models. J. Mach. Learn. Res. 2016, 17, 1–42. [Google Scholar]
- Carpentier, A.; Locatelli, A. Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem. In Proceedings of the Conference On Learning Theory, New York, NY, USA, 23–26 June 2016. [Google Scholar]
- Chen, L.; Li, J.; Qiao, M. Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
- Simchowitz, M.; Jamieson, K.G.; Recht, B. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime. arXiv 2013, arXiv:abs/1702.05186. [Google Scholar]
- Bubeck, S.; Bianchi, N.C. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning; Now Publishers Inc.: Hanover, MA, USA, 2012; Volume 5. [Google Scholar]
- Royden, H.; Fitzpatrick, P. Real Analysis, 4th ed.; Pearson: New York, NY, USA, 2010. [Google Scholar]
- Katariya, S.; Jain, L.; Sengupta, N.; Evans, J.; Nowak, R. Adaptive Sampling for Coarse Ranking. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Lanzarote, Spain, 9–11 April 2018. [Google Scholar]
- Billingsley, P. Probability and Measure, 3rd ed.; Wiley-Interscience: Hoboken, NJ, USA, 1995. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Truong, L.V.; Scarlett, J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy 2020, 22, 788. https://doi.org/10.3390/e22070788
Truong LV, Scarlett J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy. 2020; 22(7):788. https://doi.org/10.3390/e22070788
Chicago/Turabian StyleTruong, Lan V., and Jonathan Scarlett. 2020. "On Gap-Based Lower Bounding Techniques for Best-Arm Identification" Entropy 22, no. 7: 788. https://doi.org/10.3390/e22070788