Action-Value Methods

We denote the true (actual) value of action a as $q (a)$ , and the estimated value on the $t$ -th time step as $Q_{t} (a)$ .

Action-Value methods

These approaches are repeatedly selecting an action and estimating their value.

value estimate

Sample average

We perform exploration and for every distinct action $a$ , which was chosen $N_{t} (a)$ times. If individual rewards of the action $a$ are $R_{i}$ , where $i \in {1, \dots, N_{t} (a)}$ , the average reward is estimated as

Q_{t} (a) = \frac{R _{1} + R _{2} + \dots + R _{N_{t} (a)}}{N _{t} ( a )} .

Due to the Law of large numbers, $Q_{t} (a)$ converges to $q (a)$ .

To decrease memory requirement, we can only store $Q_{k}$ and $t$ . The incremental average can be derived iteratively as

Q_{k + 1} = Q_{k} + \frac{1}{k} [R_{k} - Q_{k}] .

Weighted sample average

variant of this incremental sum, where recent actions have larger impact than the old ones, can have a constant step-size parameter $α \in (0, 1]$ . This changes the formula to

Q_{k + 1} = (1 - α)^{k} Q_{1} + i = 1 \sum k α (1 - α)^{k - i} R_{i} .

weight $α (1 - α)^{k - i}$ decays exponentially for the constant, sometimes it is called exponential, recency-weighted average.
it is suitable for non-stationary problems, where environment(the bandit) changes.

Step size $α_{k} (a) = \frac{1}{k}$ results in basic sample average, which doesn’t decay and is more suitable for stationary bandits. Robbins-Monro theorem can be used to determine, whether the sequence with step-size $α_{k} (a)$ converges to truth value $q (a)$ by satisfying following conditions

$\sum_{k = 1}^{\infty} α_{k} (a) = \infty$
$\sum_{k = 1}^{\infty} α_{k}^{2} (a) < \infty$

Non-convergance is not bad, it simply means that the the action value won’t stop changing. That is desirable for non-stationary problems.

action selecting

Greedy approach always selects the action with greatest value. Action selected at the step $t$ is given by

A_{t} = a arg max Q_{t} (a) .

This approach always exploits.

$ε$ -greedy approach has probability $ε$ to explore, instead of exploit. As $N_{t} (a) \to \infty$ , each action is sampled infinitely often, the convergence of $Q_{t} (a)$ is guaranteed if the sequence converges. Probability of selecting greedy action is at least $1 - ε$ .

Vojtěch Tóth

Explorer

Action-Value Methods