Pr(At=a)=∑b=1neHt(a)eHt(a)=πt(a) Learning these probability distributions is done with algorithm based on Stochastic Gradient Descent. Ht+1(At)Ht+1(a)=Ht(At)+α(Rt−Rt)(1−πt(At))=Ht(a)−α(Rt−Rt)(πt(a))and∀a=At