Access the full text.
Sign up today, get DeepDyve free for 14 days.
Jean-Yves Audibert, Sébastien Bubeck (2009)
Minimax Policies for Adversarial and Stochastic Bandits
Matthew Streeter, Stephen Smith (2006)
A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem
Alexander Strehl, Chris Mesterharm, M. Littman, H. Hirsh (2006)
Experience-efficient learning in associative bandit problemsProceedings of the 23rd international conference on Machine learning
S Bubeck, R Munos, G Stoltz (2009)
Algorithmic Learning Theory (ALT 2009)
Alexander Strehl, J. Langford, Lihong Li, S. Kakade (2010)
Learning from Logged Implicit Exploration Data
S. Kakade, S. Shalev-Shwartz, Ambuj Tewari (2008)
Efficient bandit algorithms for online multiclass prediction
Rémi Coulom (2007)
Computing "Elo Ratings" of Move Patterns in the Game of GoJ. Int. Comput. Games Assoc., 30
B. Bouzy, Guillaume Chaslot (2005)
Bayesian Generation and Integration of K-nearest-neighbor Patterns for 19x19 Go
Levente Kocsis, Csaba Szepesvari (2006)
Bandit Based Monte-Carlo Planning
Rémi Coulom (2008)
THE 12 th GAME PROGRAMMING WORKSHOPICGA Journal, 31
Xindi Cai, D. Wunsch (2007)
Computer Go: A Grand Challenge to AI
Sylvain Gelly, David Silver
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go
Guillaume Chaslot, Louis Chatriot, C. Fiter, S. Gelly, Jean-Baptiste Hoock, Julien Perez, Arpad Rimmel, O. Teytaud (2008)
Combining expert, offline, transient and online knowledge in Monte-Carlo exploration
B Bouzy, B Helmstetter (2003)
Advances in Computer Games (ACG 2003), IFIP, vol. 263
Sébastien Bubeck, R. Munos, Gilles Stoltz (2009)
Pure Exploration in Multi-armed Bandits Problems
A. Coles, M. Fox, D. Long, A. Coles (2008)
Proceedings of the Twenty-third AAAI Conference on Artificial Intelligence and the Twentieth Innovative Applications of Artificial Intelligence Conference
B. Bouzy, T. Cazenave (2001)
Computer Go: An AI oriented surveyArtif. Intell., 132
B. Bouzy, Bernard Helmstetter (2003)
Monte-Carlo Go Developments
S. Gelly, David Silver (2007)
Combining online and offline knowledge in UCT
S. Shalev-Shwartz, O. Shamir, N. Srebro, Karthik Sridharan (2009)
Stochastic Convex Optimization
(2010)
Multi-armed bandits with episode context.
J. Langford, Tong Zhang (2007)
The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Sylvain Gelly, David Silver (2008)
Achieving Master Level Play in 9 x 9 Computer Go
Guillaume Chaslot, C. Fiter, Jean-Baptiste Hoock, Arpad Rimmel, O. Teytaud (2009)
Adding Expert Knowledge and Exploration in Monte-Carlo Tree Search
J Langford, T Zhang (2007)
Neural Information Processing Systems (NIPS)
ICML, volume 307 of ACM International Conference Proceeding Series
F Mesmay, A Rimmel, Y Voronenko, M Püschel (2009)
International Conference on Machine Learning (ICML 2009)
Rémi Coulom (2006)
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Yifei Wang, N. Dahnoun, A. Achim (2013)
Studies in Computational Intelligence
Eyal Even-Dar, Shie Mannor, Y. Mansour (2006)
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning ProblemsJ. Mach. Learn. Res., 7
MJ Streeter, SF Smith (2006)
Principles and Practice of Constraint Programming (CP 2006)
Frédéric Mesmay, Arpad Rimmel, Y. Voronenko, Markus Püschel (2009)
Bandit-based optimization on graphs with application to library performance tuning
G. M, J-B Chaslot, M. Winands, H. Herik, J. Uiterwijk, B. Bouzy (2008)
Progressive Strategies for Monte-Carlo Tree SearchNew Mathematics and Natural Computation, 04
Hilmar Finnsson, Y. Björnsson (2008)
Simulation-Based Approach to General Game Playing
O. Teytaud, S. Gelly, M. Sebag (2007)
Anytime many-armed bandits
P. Auer, N. Cesa-Bianchi, Y. Freund, R. Schapire (2002)
The Nonstochastic Multiarmed Bandit ProblemSIAM J. Comput., 32
Rémi Coulom (2007)
Monte-Carlo Tree Search in Crazy Stone, 2007
Shie Mannor, J. Tsitsiklis (2004)
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
R Coulom (2006)
Computers and Games (CG 2006)
P. Auer, N. Cesa-Bianchi, P. Fischer (2002)
Finite-time Analysis of the Multiarmed Bandit ProblemMachine Learning, 47
Jean-Yves Audibert, R. Munos, Csaba Szepesvari (2009)
Exploration-exploitation tradeoff using variance estimates in multi-armed banditsTheor. Comput. Sci., 410
A. Juditsky, A. Nazin, A. Tsybakov, N. Vayatis (2008)
Gap-free Bounds for Stochastic Multi-Armed BanditIFAC Proceedings Volumes, 41
X Cai, DC Wunsch (2007)
Challenges for Computational Intelligence
Student Wang, F. Kulkarni, F. Poor (2005)
Bandit problems with side observationsIEEE Transactions on Automatic Control, 50
G Chaslot, C Fiter, JB Hoock, A Rimmel, O Teytaud (2009)
Advances in Computer Games (ACG12)
A multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0,1] associated with that arm. We assume contextual side information is available at the start of the episode. This context enables an arm predictor to identify possible favorable arms, but predictions may be imperfect so that they need to be combined with further exploration during the episode. Our setting is an alternative to classical multi-armed bandits which provide no contextual side information, and is also an alternative to contextual bandits which provide new context each individual trial. Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worst-case regret bounded by $O\left(\sqrt{Kn\log(n)}\right)$ . We seek to improve this using episode context, particularly in the case where K is large. Using a predictor that places weight M i > 0 on arm i with weights summing to 1, we present the PUCB algorithm which achieves regret $O\left(\frac{1}{M_{\ast}}\sqrt{n\log(n)}\right)$ where M ∗ is the weight on the optimal arm. We illustrate the behavior of PUCB with small simulation experiments, present extensions that provide additional capabilities for PUCB, and describe methods for obtaining suitable predictors for use with PUCB.
Annals of Mathematics and Artificial Intelligence – Springer Journals
Published: Aug 26, 2011
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.