James Cheshire
Abstract: We consider a stochastic multi armed bandit problem with potentially infinite arms, with the assumption that the arm set can be partitioned into an unknown number of \(K\) distinct categories, where arms in the same category are identically distributed. Let \(\theta_k\) denote the proportion of the \(k\)th category in the arm set and let \(\Delta_k\) denote the minimum gap from the \(k\) category to another, in terms of expected reward. We aim to identify optimal learning rates in the setting where the goal of the learner is to identify a member of \textit{each individual category}. That is, at the end of \(T\) rounds they aim to output a list of \(K\) arms containing a distinct member of each category. This is in contrast to the classical problem of best arm identification where the goal of the learner is return a single arm of the optimal category. The regret of the leaner will be given by their probability of error. We describe an algorithm which we conjecture attains the optimal rate, \(\exp\left(-T\min(\theta_k \Delta_k)\right)\), up to log terms. The proposed algorithm does not require knowledge of the \(\theta_k\)s. Our algorithm does, however, need to know \(\min(\Delta_k)\), or a lower bound on it, which will then appear in the regret bound.
Zoom-link on request