Summary

The goal of Reinforcement learning is to learn a policy which maximises the rewards (in a finite or infinite horizon). Instead of choosing the best action from the state-value estimation, why not learn the policy directly? In other words, we want to learn the optimal $\theta$ in $\pi_{\theta} (a\mid s)$.

Hence, Policy-based RL is an optimization problem. The general outline is

  1. Construct the objective $J(\theta)$ to maximise.
  2. Compute the gradient $\nabla_{\theta} J(\theta)$.
  3. Find a form of the gradient which allows stochastic updates.
  4. Find the local maximum by Stochastic Gradient Ascent; $\theta = \theta + \alpha \nabla_\theta (J(\theta))$, where $\alpha$ is a step-size.

TLDR;

Step 3 requires mathematical tricks because Step 1 $J(\theta)$ is an expectation over states and actions. i.e., the gradient is an expectation of a function $\nabla_\theta E_{a, s}[r(a,s)]$. This is bad because we can only compute the gradient after taking many Monte-carlo samples from the trajectory.

Whenever the gradient is outside the expectation, the goal would be to derive $\nabla_\theta E[R] \rightarrow E[\nabla_\theta R]$.


The Objective Function $J(\theta)$

To objective function to optimize, is the expected rewards from trajectories $\tau$ with finite horizon, sampled from the policy with parameters $\theta$. Where $p_{\theta} (\tau) = p_{\theta}(s_1, a_1, \cdots s_T, a_T)$.

\begin{align} J(\theta) = E_{\tau \sim p_\theta(\tau)} [\sum_t r(s_t, a_t)] \end{align} \begin{align} \sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta (s_t, a_t)} [r(s_t, a_t)] \end{align}

The average reward, where $\mu_{\pi_\theta}(s)$ is the stationary distribution over states under the policy $\pi_\theta$ is

\begin{align} E_{(s, a) \sim p_\theta (s, a)} [r(s, a)] \
&=\sum_s \mu_{\pi_\theta}(s) \sum_a \pi_\theta (a|s) r(a, s) \end{align}

Note that although this no longer looks like a trajectory, it is a valid objective because the stationary distribution $\mu_{\pi(\theta)}(s)$ captures the proportion of the future states.


Compute the gradient $\nabla_{\theta}J(\theta)$

The gradient at time step $t$ can be written by taking the gradient of Eq (3).

\begin{align} \nabla_{\theta}J(\theta) = \nabla_{\theta}(\sum_s \mu_{\pi_\theta}(s) \sum_a \pi_\theta (a|s) r(a, s)) \end{align}

The policy gradient theorem states that we can pass the gradient through the stationary distribution over states to get

\[\begin{align} &\nabla_{\theta} (\sum_s \mu (s) \sum_a \pi_\theta (a \mid s) r(a, s)) \\ &=\sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s) \end{align}\]

$\sum_a \nabla_\theta \pi_\theta (a \mid s) r(a, s)$ is a sum over the gradients (direction of update), for each action probability from the policy; the gradient indicates the direction to change the weights to increase the probability of that action. This is weighted by r(a, s); the value of that action.


Computing a form of $\nabla_\theta J(\theta)$ for stochastic updates

Performing a single step of gradient update now looks like this:

\[\begin{align} \theta &= \theta + \alpha \nabla_\theta J(\theta)\\ \theta &= \theta + \alpha (\sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s)) \end{align}\]

This is still expensive to compute because of the sum over states and actions ($\sum_s \mu (s) \sum_a$). Instead we would like to make updates based on the state and action we just visited.


The log-derivative or Score function trick

From eq (6),

\[\begin{align} \nabla_\theta J(\theta) &= \sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s) \\ &= \sum_s \mu(s) \sum_a \pi_\theta (a \mid s) \frac{\nabla_\theta \pi_\theta(a \mid s)}{\pi_\theta(a \mid s)} r(a, s)\\ &= \sum_s \mu(s) \sum_a \pi_\theta (a \mid s) \nabla_\theta \log \pi_\theta (a \mid s) r(a, s)\\ &= E_{s, a}[\nabla_\theta \log \pi_\theta (a \mid s) r(a, s)] \end{align}\]

Eq (10) uses the multiply by 1 trick.
Eq (11) uses the log-derivative trick $\frac{\nabla_\theta f(\theta)}{f(\theta)}= \nabla_\theta \log f(\theta)$.

This is now an unbiased estimate of the gradient, and the new stochastic gradient ascent update looks like

\begin{align} \theta = \theta + \alpha \nabla_\theta \log \pi(a \mid s) r(a, s) \end{align}


Implications of Converting the Gradient update from $\nabla_\theta \mathbb{E}[R] \rightarrow \mathbb{E} [\nabla_\theta R]$

Although this update is unbiased in expectation, it has very high variance due to the sparse reward signals and noise in trajectories, and does not work well in practice. Suppose we want to evaluate the expected gradient $G=\mathbb{E}[\nabla_\theta R]$. We can estimate $G$ by generating $N$ independent rollouts and taking the average of the gradients:

\[\tilde{G} = \frac{1}{N} \sum_{n=1}^N \nabla_{\theta}R_n\]

Although $\tilde{G}$ is a Monte-Carlo estimator unbiased estimator of $\mathbb{E}[\nabla_\theta R]$, the error due to variance is

\[Var(\nabla_\theta R) = \mathbb{E}[(\nabla_\theta R-\mathbb{E}[\nabla_\theta R])^2] \\\]

Note: Recall $Var(X)=\mathbb{E}[(X - \mathbb{E}[X])^2]$

The estimate of this variance if we wanted to quantify it is

\[\tilde{Var}(\nabla_\theta R) = \mathbb{E}[(\nabla_\theta R - \tilde{G}) ^2]\]


Variance Reduction Methods

Subtracting a control variate (more colloquially known in RL world as a baseline term $b$) can reduce the variance of the gradient estimate.

If the control does not depend on $\theta$, the gradient expectation doesn’t change and continues to be unbiased, since $\mathbb{E}[\nabla_\theta b)] = 0$, $\mathbb{E}[\nabla_\theta(R-b)]=\mathbb{E}[\nabla_\theta R]$.

Hand-wavy claim If $b$ is a good estimate of the expected reward of being in that state, independent of the action taken, $b(s)\approx \mathbb{E}[R|s]$, then the $Var(R-b(s)) < Var(R)$.

Intuitively, by subtracting the average (baseline) reward of being in that state, the remainder $R-b(s)$ has less variation because hopefully most of it has been explained by b(s).

Formally, whether $Var(R-b(s)) < Var(R)$, depends on the covariance between $R$ and $b(s)$, and the Variance of $b(s)$.

\[Var(R-b(s)) = Var(R) + Var(b(s)) - 2Cov(R, b(s))\]

From here we can see that the larger the covariance term $Cov(R, b(s))$ is, the smaller the $Var(R-b(s))$.

Note: $Var(X-Y) = Var(X) + Var(Y) - 2Cov(X, Y)$


Variance Reduction Algorithms

Different policy gradient RL algorithms propose ways to deal with the high variance of this gradient without changing the expected value of the gradient. For instance,

  • Actor-critic subtracts a baseline value, where the baseline value is estimated with a critic-model
  • PPO uses a clipped objective to ensure policy updates are not too large.
  • TPO constrains the updates with a ‘trust region’.
  • DDPG uses a replay buffer to store past experiences and a target network to stabilize training.


References

Deep Mind Lecture on Policy Gradients
Berkerley Lecture on Policy Gradients
Shakir Mohammed on Log-derivative trick
David Meyer on Policy Gradients
NYU Lecture Notes on Monte Carlo Methods (and Control variates)