We denote the true (actual) value of action a as , and the estimated value on the -th time step as .

Action-Value methods

These approaches are repeatedly selecting an action and estimating their value.

value estimate

Sample average

We perform exploration and for every distinct action , which was chosen times. If individual rewards of the action are , where , the average reward is estimated as

Due to the Law of large numbers, converges to .

To decrease memory requirement, we can only store and . The incremental average can be derived iteratively as

Weighted sample average

variant of this incremental sum, where recent actions have larger impact than the old ones, can have a constant step-size parameter . This changes the formula to

  • weight decays exponentially for the constant, sometimes it is called exponential, recency-weighted average.
  • it is suitable for non-stationary problems, where environment(the bandit) changes.

Step size results in basic sample average, which doesn’t decay and is more suitable for stationary bandits. Robbins-Monro theorem can be used to determine, whether the sequence with step-size converges to truth value by satisfying following conditions

Non-convergance is not bad, it simply means that the the action value won’t stop changing. That is desirable for non-stationary problems.

action selecting

Greedy approach always selects the action with greatest value. Action selected at the step is given by

This approach always exploits.

-greedy approach has probability to explore, instead of exploit. As , each action is sampled infinitely often, the convergence of is guaranteed if the sequence converges. Probability of selecting greedy action is at least .