[n-armed bandits]

Inspired by a slot machine with multiple levers, this problem models a situation, where we have actions to choose from, and want to maximize overall reward over multiple tries. It is a type of problem from Reinforcement learning.

Description Let there be different options, or actions. You repeatedly choose from these actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. The objective is to maximize the expected total reward over some time period of time or discrete steps .

Each action has some unknown expected reward due to uncertainty.

Solving methods

  • Methods that estimate action values and use those estimates to select actions are called Action-Value Methods.
  • Methods based on probabilistic numerical preference of an action fall under Gradient bandits.
  • Contextual bandits bridge bandits to the full reinforcement learning problem in a sense that actions depend on current situation, but still affect only the immediate reward. Associative Search should be applied in these situations.

All these methods are centered around exploration-exploitation trade-off. We cannot search the whole action space, but greedy algorithms are often ineffective.

Links

https://cxl.com/blog/bandit-tests/