Summary

The goal of Reinforcement learning is to learn a policy which maximises the rewards (in a finite or infinite horizon). Instead of choosing the best action from the state-value estimation, why not learn the policy directly? In other words, we want to learn the optimal $\theta$ in $\pi_{\theta} (a\mid s)$.

Hence, Policy-based RL is an optimization problem. The general outline is

  1. Construct the objective $J(\theta)$ to maximise.
  2. Compute the gradient $\nabla_{\theta} J(\theta)$.
  3. Find a form of the gradient which allows stochastic updates.
  4. Find the local maximum by Stochastic Gradient Ascent; $\theta = \theta + \alpha \nabla_\theta (J(\theta))$, where $\alpha$ is a step-size.

TLDR;

Step 3 requires mathematical tricks because Step 1 $J(\theta)$ is an expectation over states and actions. i.e., the gradient is an expectation of a function $\nabla_\theta E_{a, s}[r(a,s)]$. This is bad because we can only compute the gradient after taking many Monte-carlo samples from the trajectory.

Whenever the gradient is outside the expectation, the goal would be to derive $\nabla_\theta E[R] \rightarrow E[\nabla_\theta R]$.


The Objective Function $J(\theta)$

To objective function to optimize, is the expected rewards from trajectories $\tau$ with finite horizon, sampled from the policy with parameters $\theta$. Where $p_{\theta} (\tau) = p_{\theta}(s_1, a_1, \cdots s_T, a_T)$.

\begin{align} J(\theta) = E_{\tau \sim p_\theta(\tau)} [\sum_t r(s_t, a_t)] \end{align} \begin{align} \sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta (s_t, a_t)} [r(s_t, a_t)] \end{align}

The average reward, where $\mu_{\pi_\theta}(s)$ is the stationary distribution over states under the policy $\pi_\theta$ is

\begin{align} E_{(s, a) \sim p_\theta (s, a)} [r(s, a)] \
&=\sum_s \mu_{\pi_\theta}(s) \sum_a \pi_\theta (a|s) r(a, s) \end{align}


Compute the gradient $\nabla_{\theta}J(\theta)$

The gradient at time step $t$ can be written by taking the gradient of Eq (3).

\begin{align} \nabla_{\theta}J(\theta) = \nabla_{\theta}(\sum_s \mu_{\pi_\theta}(s) \sum_a \pi_\theta (a|s) r(a, s)) \end{align}

The policy gradient theorem states that we can pass the gradient through the stationary distribution over states to get

\[\begin{align} &\nabla_{\theta} (\sum_s \mu (s) \sum_a \pi_\theta (a \mid s) r(a, s)) \\ &=\sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s) \end{align}\]

$\sum_a \nabla_\theta \pi_\theta (a \mid s) r(a, s)$ is a sum over the gradients (direction of update), for each action probability from the policy; the gradient indicates the direction to change the weights to increase the probability of that action. This is weighted by r(a, s); the value of that action.


Computing a form of $\nabla_\theta J(\theta)$ for stochastic updates

Performing a single step of gradient update now looks like this:

\[\begin{align} \theta &= \theta + \alpha \nabla_\theta J(\theta)\\ \theta &= \theta + \alpha (\sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s)) \end{align}\]

This is still expensive to compute because of the sum over states and actions ($\sum_s \mu (s) \sum_a$). Instead we would like to make updates based on the state and action we just visited.


The log-derivative or Score function trick

From eq (6),

\[\begin{align} \nabla_\theta J(\theta) &= \sum_s \mu (s) \sum_a \nabla_\theta \pi_\theta(a \mid s) r(a, s) \\ &= \sum_s \mu(s) \sum_a \pi_\theta (a \mid s) \frac{\nabla_\theta \pi_\theta(a \mid s)}{\pi_\theta(a \mid s)} r(a, s)\\ &= \sum_s \mu(s) \sum_a \pi_\theta (a \mid s) \nabla_\theta \log \pi_\theta (a \mid s) r(a, s)\\ &= E_{s, a}[\nabla_\theta \log \pi_\theta (a \mid s) r(a, s)] \end{align}\]

Eq (10) uses the multiply by 1 trick.
Eq (11) uses the log-derivative trick $\frac{\nabla_\theta f(\theta)}{f(\theta)}= \nabla_\theta \log f(\theta)$.

This is now an unbiased estimate of the gradient, and the new stochastic gradient ascent update looks like

\begin{align} \theta = \theta + \alpha \nabla_\theta \log \pi(a \mid s) r(a, s) \end{align}


Implications of Converting the Gradient update from $\nabla_\theta E[R] \rightarrow E [\nabla_\theta R]$

Although this update is unbiased in expectation, it has very high variance and does not work well in practice. Different policy gradient RL algorithms propose ways to deal with the high variance of this gradient. For instance,

  • Actor-critic subtracts a baseline value, which is estimated with a critic-model
  • PPO uses a clipped objective to ensure policy updates are not too large.
  • TPO constrains the updates with a ‘trust region’.
  • DDPG uses a replay buffer to store past experiences and a target network to stabilize training.


References

Deep Mind Lecture on Policy Gradients
Berkerley Lecture on Policy Gradients
Shakir Mohammed on Log-derivative trick
David Meyer on Policy Gradients