Key Concepts

  • Markov-Chain Monte-Carlo is a class of algorithms, which seek to use sampling over a distribution, to provide a principled way to approximate the value of an integral. For Machine Learning, we need to approximate integrals when doing probability estimation over continuous-valued parameters of the model.

  • Monte-Carlo are algorithms that rely on repeated sampling(draws from a probability distribution) to obtain numerical estimates.

  • Markov-Chain refers to the sequence of variables that the sampling process moves through, that has the “Markov property”, which is dependence only on the previous state.

Model Preliminaries

In statistical learning, we are often uncertain about the parameters $ \theta $ of our model. Under a Bayesian framework, we can learn the parameters $\theta$ of our model using probabilistic inference from data/observations. Ultimately, we want $ p(y|X) $, but instead of getting a single hard estimate on $ \theta $, we want to take the distribution $ p(\theta | x)$ into account.

\begin{equation} p(y|X) = \int p(y|\theta, X)p(\theta|X) d\theta \end{equation}

Then, by sampling parameters of the model $\theta$ from the distribution $P(\theta|X)$, we can approximate integrating over $\theta$ for $P(y|\theta, X)P(\theta|X)$, and estimating $y$ by calculating the empirical average.

\begin{equation} \approx\frac{1}{S}\sum_{s=1}^{S}P(y|\theta^{(s)}, X), \theta^{(s)} \sim P(\theta|X) \end{equation}

Inference of $ p(\theta|X) $ is generally untractable in real world problems. By Bayes Rule,

\begin{equation} p(\theta|X) = \frac{p(X|\theta)p(\theta)}{\int p(X|\theta)p(\theta)d(\theta)} \end{equation}

Where the integral in the denominator is intractable. Hence we rely on sampling from a distribution that asymptotically follows $ p(\theta|X, y)$ without having to explicitly calculate the integrals. Why does sampling work?

If we sample N points of $\theta$ at random from the probability density $ p(\theta|X) $, then

\begin{equation} \mathbb{E}[p(y|\theta)]=\mathbb{E}[f(\theta)] = \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{t=1}^{N}f(\theta^t) \end{equation}

Monte Carlo simulations (without Markov Chain), a random sampling approach, are less effective in higher dimensions because it does not gradually improve on good distributions that it has stumbled upon. For Markov Chain methods, it makes small changes from the existing distribution, and proposes small changes instead of sampling independently. The key idea is to sample $\theta$ proportional to $p(\theta)$, by transitioning between states (the “chain”). We sample by transitioning between parameter values with a transition probability. That is, $\theta^{(t+1)}:=g(\theta^{(t)})$, where $g$ is a transition function. “Markov” means that the next transition only depends on the previous state, i.e.

\begin{equation} g(\theta^{(t+1)}|\theta^{(t)}) \end{equation}

Different types of Markov Chain Monte Carlo algorithms are different ways to sample with transition $g$, such that the likelihood is proportional to the true distribution.