Gradient bandits

Pr (A_{t} = a) = \frac{e ^{H_{t} (a)}}{\sum _{b = 1}^{n} e ^{H_{t} (a)}} = π_{t} (a)

Learning these probability distributions is done with algorithm based on Stochastic Gradient Descent.

H_{t + 1} (A_{t}) H_{t + 1} (a) = H_{t} (A_{t}) + α (R_{t} - \overline{R}_{t}) (1 - π_{t} (A_{t})) = H_{t} (a) - α (R_{t} - \overline{R}_{t}) (π_{t} (a)) and \forall a \neq = A_{t}

Vojtěch Tóth