Policy Gradient Methods 0 Gradient Bandit Algorithms

策略梯度方法的前置算法: 梯度老虎机算法, 以及系列博客的参考资料.

A k-armed Bandit Problem

k臂老虎机 (k-armed bandit), 是由\(k\)个单臂老虎机组成的机器.

Slot machine

A slot machine (American English), known variously as a fruit machine (British English), puggy (Scottish English), the slots (Canadian English and American English), poker machine/pokies (Australian English and New Zealand English), fruities (British English) or slots (American English), is a gambling machine that creates a game of chance for its customers. Slot machines are also known pejoratively as one-armed bandits because of the large mechanical levers affixed to the sides of early mechanical machines and the games' ability to empty players' pockets and wallets as thieves would.

老虎机(美式英语),被称为水果机(英式英语)、puggy(苏格兰英语)、老虎机(加拿大英语和美式英语)、扑克机/扑克机(澳大利亚英语和新西兰英语) 、fruities(英式英语)或老虎机(美式英语)是一种赌博机器,可为客户创造机会游戏。 老虎机也被贬称为单臂强盗,因为早期机械机器的侧面附有大型机械杆,并且游戏能够像小偷一样清空玩家的口袋和钱包。

k臂老虎机问题 (k-armed bandit problem) 是问如何在不断玩老虎机的过程中, 训练一个智能体 (Agent), 使其能判断出最赚钱的老虎机, 从而最大化收益.

强化学习 (Reinforcement Learning) 的视角来看, k臂老虎机是环境 (Environment); 老虎机的选择是行为 (Action), 设\(t\)时刻的选择为\(A_t\); 该选择所赚的钱是奖励 (Reward), 记为\(R_t\).

从上帝视角看, 每个行为\(a\)都有一个期望收益, 由所选老虎机的奖励分布决定, 记为\(q_{*}(a) \doteq \mathbb{E}\left[R_{t} \mid A_{t}=a\right]\). 然而该值我们是不知道的, 我们可以想办法在\(t\)时刻给出做\(a\)动作的价值估计\(Q_t(a)\), 并使其尽可能地接近\(q_*(a)\); 我们也可以给动作定义偏好, 并将其直接转化为选取动作的概率分布. 本系列博客主要讨论后者.

下图给出了一个\(k=10\), 奖励分布均为正态分布的例子. 由于\(q_*(3)\)最大, 该问题的解是一直玩三号老虎机.

Gradient Bandit Algorithms

我们给每个行为\(a\)都定义一个数值, 称为对该行为的偏好 (Preference), 记为\(H_t(a)\). \(H_t(a)\)可以初始化为相同值, 如\(0\).

选取\(a\)的概率由\(H_t(a)\)作用soft-max distribution后得到, 即 \[ \operatorname{Pr}\left\{A_{t}=a\right\} \doteq \frac{e^{H_{t}(a)}}{\sum_{b=1}^{k} e^{H_{t}(b)}} \doteq \pi_{t}(a). \] 梯度老虎机 (Gradient Bandit Algorithms) 算法即通过以下基于随机梯度上升的方式迭代\(H_t\) \[ \begin{aligned} H_{t+1}\left(A_{t}\right) & \doteq H_{t}\left(A_{t}\right)+\alpha\left(R_{t}-\bar{R}_{t}\right)\left(1-\pi_{t}\left(A_{t}\right)\right), & & \text { and } \\ H_{t+1}(a) & \doteq H_{t}(a)-\alpha\left(R_{t}-\bar{R}_{t}\right) \pi_{t}(a), & & \text { for all } a \neq A_{t}. \end{aligned} \] 迭代过程中\(\sum\limits_a H_t(a)\)始终保持不变, 这是因为 \[ \begin{aligned} \alpha R_t-\overline{R_t}(1-\pi_t(A_t))-\sum\limits_{a\neq A_t}\alpha(R_t-\overline R_t)\pi_t (a) =\alpha(R_t-\overline{R_t})[1-\pi_t(A_t)-\sum\limits_{a\neq A_t}\pi_t(a)] =0 \end{aligned} \]

下图是运用该算法求解\(k=10\)的多臂老虎机问题的训练曲线, 其中最赚老虎机的奖励分布为\(\mu=4\), \(\sigma=1\)的正态分布.

Gradient Bandit Algorithms as Stochastic Gradient Ascent

我们的最大化目标是\(\mathbb{E}\left[R_{t}\right]=\sum\limits_{x=1}^k \pi_{t}(x) q_{*}(x)\), 其中\(\pi_t(x)\)代表选择第\(x\)个老虎机的概率, \(q_*\)是未知的.

\(\mathbb{E}\left[R_{t}\right]\)也可理解为偏好\(H_t(a)\), \(a=1,\ldots,k\)的函数. 因此由梯度上升的迭代公式, 我们有 \[ H_{t+1}(a) \doteq H_{t}(a)+\alpha \frac{\partial \mathbb{E}\left[R_{t}\right]}{\partial H_{t}(a)}. \] 下证梯度老虎机算法是仅用一个样本更新的梯度上升, 即随机梯度上升. \[ \begin{aligned} \frac{\partial \mathbb{E}\left[R_{t}\right]}{\partial H_{t}(a)} &=\frac{\partial}{\partial H_{t}(a)}\left[\sum\limits_{x=1}^k \pi_{t}(x) q_{*}(x)\right]\quad (\mathbb{E}\left[R_t\right]的定义) \\ &=\sum\limits_{x=1}^k q_{*}(x) \frac{\partial \pi_{t}(x)}{\partial H_{t}(a)} \quad(偏导线性性质)\\ &=\sum\limits_{x=1}^k\left(q_{*}(x)-B_{t}\right) \frac{\partial \pi_{t}(x)}{\partial H_{t}(a)} \quad( B_t是与x无关的\textit{基线} (\textit{baseline}))\\ &=\sum\limits_{x=1}^k \pi_{t}(x)\left(q_{*}(x)-B_{t}\right) \frac{\partial \pi_{t}(x)}{\partial H_{t}(a)} / \pi_{t}(x)\quad (乘上\pi_t(x)/ \pi_t(x)) \\ &=\mathbb{E}\left[\left(q_{*}\left(A_{t}\right)-B_{t}\right) \frac{\partial \pi_{t}\left(A_{t}\right)}{\partial H_{t}(a)} / \pi_{t}\left(A_{t}\right)\right] \quad (对A_t按\pi_t取期望)\\ &=\mathbb{E}\left[\left(R_{t}-\overline{R}_{t}\right) \frac{\partial \pi_{t}\left(A_{t}\right)}{\partial H_{t}(a)} / \pi_{t}\left(A_{t}\right)\right]\quad (q_*(A_t)=\mathbb{E}\left[R_t\mid A_t\right]) \\ &=\mathbb{E}\left[\left(R_{t}-\bar{R}_{t}\right) \pi_{t}\left(A_{t}\right)\left(\mathbb{1}_{a=A_{t}}-\pi_{t}(a)\right) / \pi_{t}\left(A_{t}\right)\right] \quad (写开\pi_t(A_t)后用商的求导法则)\\ &=\mathbb{E}\left[\left(R_{t}-\bar{R}_{t}\right)\left(\mathbb{1}_{a=A_{t}}-\pi_{t}(a)\right)\right] \end{aligned} \] 由此可见梯度老虎机算法的增量即为上式的抽样版本.

Notes

  1. 结合DeepMind的系列教学视频
  2. 思考为何baseline对算法收敛重要

References

  1. Sutton, R., & Barto, A. (2005). Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 16, 285-286.