[n-armed bandits]
Inspired by a slot machine with multiple levers, this problem models a situation, where we have actions to choose from, and want to maximize overall reward over multiple tries. It is a type of problem from Reinforcement learning.
Description Let there be different options, or actions. You repeatedly choose from these actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. The objective is to maximize the expected total reward over some time period of time or discrete steps .
Each action has some unknown expected reward due to uncertainty.
Solving methods
- Methods that estimate action values and use those estimates to select actions are called Action-Value Methods.
- Methods based on probabilistic numerical preference of an action fall under Gradient bandits.
- Contextual bandits bridge bandits to the full reinforcement learning problem in a sense that actions depend on current situation, but still affect only the immediate reward. Associative Search should be applied in these situations.
All these methods are centered around exploration-exploitation trade-off. We cannot search the whole action space, but greedy algorithms are often ineffective.