Stochastic Gradient Langevin Dynamics (SGLD)1 tweaks the Stochastic Gradient Descent machinery into an MCMC sampler by adding random noise. The idea is to use Langevin Dynamics, as the proposal distribution for an MCMC sampler for efficient exploration of the state space. i.e., it treats the optimisation trajectory as an MCMC chain.


Preliminaries/Motivation

  • Gradient Descent gives us the mode (point estimate) of the MAP estimation for the parameters $\theta$ of a model, given data $X$. $\theta_{MAP} = argmax_{\theta} p(\theta | X)$. The problem with this $\theta_{MAP}$ estimate, is that we do not have any information about what the posterior distribution, $p(\theta|X)$ looks like. i.e. we have no way to account for the uncertainty in the parameters $\theta$.

  • Markov Chain Monte-Carlo (MCMC) is a class of methods to approximate a complicated posterior distribution. The algorithms work by constructing a Markov Chain with particular properties, where the stationary distribution is the target posterior distribution. This means, if we take transition steps in the state space according to the MCMC algorithm, we eventually converge towards the posterior distribution.

  • Bayesian Deep Learning is a class of methods which gives us a distribution over the parameters. This is hard because the posterior distribution is very high-dimensional and complex. In the context of Neural Network parameter optimisation, a particular configuration of the NN parameters $\theta$ corresponds to a state in the entire space of possible nn parameters configurations. That’s an incredibly large state space! Computationally, our existing MCMC tools cannot efficiently explore the high dimensional space of parameters (low mixing rate of the Markov Chain). The key to practical optimisation is the algorithm by which we transition, as this determines how we explore the state space.


Background on Langevin Dynamics

The Langevin equation, is a stochastic differential equation describing how a system evolves when subject to a combination (expressed as a sum) of deterministic and random forces. The equation originally described the dynamics of particles in fluid. The change in particle momentum $dP_t$ at time $t$ in a system is given by:

\begin{equation} dP_t = \alpha P_t dt - \nabla U(x_t)dt + \sigma(x_t)dB_t \end{equation}

  • $\alpha P_t dt$ represents the inertia, where $\alpha$ is a constant coefficient related to the mass of the particle.
  • $-\nabla U$ is a global force from the particle interaction potentials which describes the evolution of the whole system. In statistical mechanics, $U$ could be like an energy function which the system will try to minimise.
  • $\sigma(x_t)dB_t$ is a noise term representing the effect of particle collisions, where $(B_t)_{t\geq 0}$ denotes the standard Brownian motion, and $dB_t$ are increments of the Brownian motion. Note that the effect of collisions is time dependent via $\sigma(x_t)$.

For simplicity, we do not consider the inertia term, $\alpha P_t dt$, which gives us the overdamped Langevin equation for all particles $X$ of the system. \begin{equation} dX_t = - \nabla U(X_t)dt + \sigma(X_t)dB_t \end{equation}


Why Langevin Dynamics can be used as the stationary distribution in MCMC

The distribution of the diffusion of $X_t$ converges to a Gibbs/Boltzman distribution $\pi(dX) \propto exp(-T(U(X))$,2 which concentrates it’s mode on the global minimum of $U(X)$, and sampling from that distribution of $U(X)$ is a similar task as minimising $U$.3 Therefore, in the context of model optimisation of parameters $\theta$, we can simply assign $U$ to be a loss function $\mathcal{L}$ (assume temperature $T=1$) of $\theta$ with respect to the data $Y$, and simplify $\sigma(\theta_t)$ to just $\sigma$.

\begin{equation} d(\theta_t) = -\nabla \mathcal{L}(\theta_t, Y)d_t + \sigma dB_t \end{equation}

where sampling with Langevin Dynamics as the proposal distribution converges to the posterior distribution which is consistent with our model’s learning objectives.


Steps towards a computationally efficient algorithm

1. Discretising the Langevin Equation

In order to transition along the state space, we need to discretise the continuous time process. The Euler method approximates $d(\theta_t)$ with the finite difference between $\theta_{t+1}-\theta_t$, and $dB_t$ is discretised by the Wiener process, $\epsilon_t - \epsilon_s \sim \mathcal{N}(0, t-s)$.

\begin{equation} \theta_{t+1} - \theta_t = -\nabla \mathcal{L}(\theta_t, Y)dt + \sigma \mathcal{N}(0, dt) \end{equation}

In the context of optimisation, $dt$ is the step-size or learning rate $\lambda$, therefore, we have

\begin{equation} \theta_{t+1} - \theta_t = - \lambda \nabla \mathcal{L}(\theta_t, Y) + \sigma \mathcal{N}(0, \lambda) \end{equation}

2. Stochastic Gradient approximation

Computing $\nabla \mathcal{L}$ requires evaluation over the entire dataset. A standard tool in Deep Learning optimisation techniques is stochastic (batch) gradient descent, which on expectation approximates the true gradient for batches sampled uniformly at random. Here $Y_i$ is a single data point, and $N_b$ is the size of the batch

\begin{equation} \theta_{t+1} - \theta_t = - \lambda \nabla(\frac{N}{N_b} \sum_{i=1}^{Nb}\mathcal{L}(\theta_t, Y_i)) + \sigma \mathcal{N}(0, \lambda) \end{equation}

3. Always accepting from the proposal distribution

The Langevin equation (eq 3) converges to the desired posterior distribution. However the discrete approximation of the continuous process in eq 5 introduces discretisation error. Therefore, to correct for this error we can take eq 5 to be a proposal distribution and correct using a Metropolis-Hastings like accept-reject step.

This is itself rather expensive to evaluate. The authors conveniently mention that with a decreasing step size of $\lambda$, which has the properties $\sum_{t=1}^{\infty}\lambda_t = \infty$, and $\sum_{t=1}^{\infty} \lambda^2 < \infty$, then $\lambda \rightarrow 0$, and the accept probability goes to 1 asymptotically because there is no more noise in sampling. Thus they claim that we can skip the accept-reject step in Metropolis Hastings, and just accept directly from the proposal distribution.

As $\lambda \rightarrow 0$, the algorithm transitions from optimisation to sampling

A cute observation in eq 6, is that $\lambda$ contributes to 2 noise terms, the noisy gradient and the random noise. In the initial phases, gradient noise $<$ random noise and the algorithm will imitate a stochastic gradient descent algorithm. Subsequently, the random noise $>$ gradient noise, and the algorithm will transition from optimisation to sampling. The authors claim that this optimisation corresponds to the “burn-in” phase of MCMC where we don’t really care about sampling from the algorithm, we just want to find our way to the posterior distribution.


The algorithm is done but the theory is only beginning

SGLD is just Gradient Descent with Noise?

The equation above looks like one could have handwaved that we can add random noise to Gradient Descent to introduce “uncertainty” into the parameters. After all, noise has been shown to be good for generalisation. In NN parameter learning, we often add noise (drop out) to regularise the model, and SGD also adds batch gradient noise which helps generalisation empirically.4 Max Welling also said that MCMC adds precisely the correct amount of noise (Bayesian) – which I dont fully understand.5

Why mention MCMC and proposal distributions only to justify that we can ignore it

MCMC gives theoretical guarantees on convergence towards the posterior distribution, as much as people can prove that Eq (6) is a “true” MCMC algorithm, the better convergence guarantees we can have. Some approximations that make this only an approximate MCMC method include6

  • approximating the gradient with batch gradient descent
  • discretisation of the Langevin dynamics
  • changing step-sizes that means there is no longer a time homogenous Markov Chain

Thus we welcome the wave of ICML/Neurips papers on mathematical proofs for non-asymptomatic behavior, number of iterations to get within the epsilon neighborhood of the target posterior and many others.


Other Comments

  • The other wave of papers is in applications, because of how easy it is to just add random noise to your optimisation algorithm, and declare Bayesian(!) without caring too much about convergence guarantees.

  • This method appears to be principally derived. However I am not sure that adding a noise term actually really does anything in terms of understanding the posterior distribution over parameters. I think we just end up in some more generalisable parameter space where adding noise doesn’t throw us out of the loss basin completely. If we were going down a narrow loss basin, noise might throw us out completely leading us to continue optimising someplace else.

  • The discussion on learning rate in the paper, and automatic trade-off between optimisation and sampling sounded slightly patchy to me. Similarly, the asymptotic argument for not needing an accept-reject step. It seems like these two are at odds, one argues the non-asymptotic case and the other argues the asymptotic case and somehow the proposed algorithm always wins.

  • Finally, to find new MCMC Stochastic Gradient Based Algorithms, find a continuous Markov Process which has the right stationary distribution. And some very cool work has been done to generalise algorithms of this sort.7


References