[n-armed bandits]

Inspired by a slot machine with multiple levers, this problem models a situation, where we have $n$ actions to choose from, and want to maximize overall reward over multiple tries. It is a type of problem from Reinforcement learning.

Description Let there be $n$ different options, or actions. You repeatedly choose from these actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. The objective is to maximize the expected total reward over some time period of time or discrete steps $t$ .

Each action has some unknown expected reward due to uncertainty.

Solving methods

Methods that estimate action values and use those estimates to select actions are called Action-Value Methods.
Methods based on probabilistic numerical preference of an action fall under Gradient bandits.
Contextual bandits bridge bandits to the full reinforcement learning problem in a sense that actions depend on current situation, but still affect only the immediate reward. Associative Search should be applied in these situations.

All these methods are centered around exploration-exploitation trade-off. We cannot search the whole action space, but greedy algorithms are often ineffective.

Links

https://cxl.com/blog/bandit-tests/

Vojtěch Tóth

Explorer

Multi-arm Bandits

Solving methods

Links

Graph View

Table of Contents

Backlinks