Gibbs Sampling on Dirichlet Multinomial Naive Bayes (Text)
Key Concepts

Gibbs Sampling is a type of MCMC method, which allows us to obtained in samples from probability distributions, without having to explicitly calculate the values for their marginalizing integrals. The key idea (amongst the other MCMC methods) is to sample each unknown variable in turn, condition on the value of all other variables in the model.

The Dirichlet Multinomial model is a frequently encountered model in Bayesian Statistics. Here a prior $\theta$ is drawn from a dirichlet distribution, with parameters $\gamma_\theta$, $\theta \sim Dir(\gamma_\theta)$, and the observed values are drawn from a Multinomial distribution with parameters $\theta$ and trials $n$, $V \sim Multinomial(\theta, n)$.

The Dirichlet distribution is a conjugate distribution to the Multinomial distribution, which has useful properties in the context of Gibbs Sampling.
Model Preliminaries
In Bayesian Inference, the aim is to infer the posterior probability distribution over a set of random variables. However this involves computing integrals which are mostly intractable.
MCMC methods are a way to approximate the value of the integral, by sampling from the joint distribution. (Hence, we never need to calculate explicitly the value of the integral). The idea is to sample from a state space of $Z$ such that the likelihood of sampling $z$ is proportional to $p(z)$. The Markov Chain indicates that the next state, $t+1$, that we visit depends only on the current state, $t$. That is,
\begin{equation} P_{trans}(z^{t+1}z^{t}) = P_{trans}(z^{t+1}z^t, z^{t1}, …, z^1) \end{equation}
MCMC methods try to approximate $P_{trans}(z^{t+1}z^{t})$ such that the probability of visiting $z$ is proportional to $p(z)$. Values of variables in the model can be approximated from simple statistics. Read more about MCMC [here].
Gibbs Sampling is one type of MCMC method, that applies when $z$ has $k$ dimensions, $z=<z_1, z_2, …, z_k>$ . Each dimension corresponds to a parameter or variable in our model. The algorithm samples from the conditional distribution by updating $z$ one dimension at a time. The probabilistic walk samples each $k$ dimension, condition on the other $k1$ dimensions. In this process, new values for the variables are used as soon as they are obtained. For each iteration $t$ in $1..T$, sample each $i$ in $1..k$, such that:
\begin{equation} z_i^{t+1} \sim P(z_i^{t+1}z_1^{t+1}, …, z_{i1}^{t+1}, z_{i+1}^{t},…, z_k^{t}) \end{equation}
In Probabilistic Graphical models, we seek to infer(through Bayesian inference) probability distributions over latent variables which are parameters in the model. To apply Gibbs Sampling, we need to first obtain the mathematical equations for the conditional updates. The general workflow is as follows:
 Assume a generative model which allows us to observe the data generated by some latent variables. The model is often constructed with domain knowledge and results in structured joint distributions.
 Draw the probabilistic graphical model using plate diagram
 Draw the probabilistic graphical model using plate diagram
 Factorise the joint distribution of the model to conditional distribution based on the model structure.
 Express conditional distributions in terms of mathematical equations.
 Express conditional distributions in terms of mathematical equations.
 Sample variables by running the Gibbs Sampling algorithm.
 Obtain equations for conditional update of variables.
 Obtain equations for conditional update of variables.
 Obtain approximate values of the variables via sample statistics.
Implementing Gibbs Sampling on Dirichlet Multinomial Naive Bayes Model
The rest of this document follows the Dirichlet Multinomial Naive Bayes (DMNB) Model from “Gibbs Sampling for the Uninitiated”. The model described in the paper has documents as the item of interest. Each document, $doc_j$, can be represented as a bag of words, $W_j$. The observed features are the words in the document, with the kth word in the jth document represented as $W_{jk}$. The goal is to predict a sentiment label $L_j$ for each of the documents, after observing $W_j$. By Bayes Rule,
\begin{equation} L_j = argmax_LP(LW_j) = argmax_L\frac{P(W_jL)P(L)}{P(W_j)} \end{equation}
Then, $L_j = argmax_LP(LW_j, \pi, \gamma_\pi, \theta, \gamma_\theta)$.
1. Assume a generative model of text
Assume a generative model expressed by the following directed graphical model. Sampling from a directed graphical model involves sampling from each node condition on its parents. For top level varibles, this involves sampling from its marginal or providing estimates directly.
 L is a 2class label, which can be sampled from a Bernoulli distribution with parameter $\pi$ where $P(L_j=1)$. For more than 2 classes, L can be sampled from a multinomial distribution.
\begin{equation} L_j \sim Bernoulli(\pi) \end{equation}
 $\pi$, the parameter of the Bernoulli distribution, is sampled from a conjugate prior probability  the Beta distribution which has hyperparameter $\gamma_\pi = [\alpha, \beta]$. Without any previous information, we use an uninformative prior, $\gamma_\pi = [1, 1]$, which returns a uniform distribution where any value of $\pi$ is equally likely.
\begin{equation} \pi \sim Beta(\gamma_\pi) \end{equation}
 We assumes a Naive model of word generation, where the words generated are independent of each other given L. The probability distribution that the words are sampled from depends on whether $L_j$ is 0 or 1. If $L_j=0$, the probability distribution we sample from is $w_0 \sim Multinomial(\theta_0)$. Generally,
\begin{equation} w_j \sim Multinomial(\theta_{L_j}) \end{equation}
 $\theta_{L_j}$, the parameter of the Multinomial distribution is sampled from a prior probability, the Dirichlet distribution. The Dirichlet distribution is the Beta distribution generalized to more than two possible values. The dirichlet distribution is parameterised by hyperparameter $\gamma_\theta$. The number of dimensions is the number of words in the vocabulary. $\gamma_\theta = [\gamma_{\theta_1}, \gamma_{\theta_2},…,\gamma_{\theta_v}]$
\begin{equation} \theta \sim Dirichlet(\gamma_\theta) \end{equation}
2. Factorise the joint distribution
The joint distribution of the model above is $P(C, L, \pi, \theta_0, \theta_1, \gamma_\pi, \gamma_\theta)$ where $C$ represents the entire document collection. $C_0$ and $C_1$ are the set of documents with label 0 and 1 respectively. The independence assumptions based on the graphical model shown in Fig1. give the following conditional distribution:
\begin{equation} P(C_0L, \theta_0)P(C_1L, \theta_1)P(\theta_0\gamma_\theta)P(\theta_1\gamma_\theta)P(L\pi)P(\pi\gamma_\pi) \end{equation}
We need to convert this into it’s mathematical form so that we can perform updates for Gibbs Sampling subsequently. The state space of variables to update is $L, \pi, \theta_0, \theta_1$.
 $P(C_xL, \theta_x)$ is the probability of observing documents(and its contents) with label $x\in \lbrace0, 1\rbrace$, given label sequence $L$ and $\theta_x$. Since the documents are generated independently of each other, we can multiply the probability of each document, $D_m$, condition on the document’s label$L_m$ and the multinomialword distribution, $\theta_{L_m}$ specific to the value of the document’s label.
Assuming a Naive Bayes model, the probability of observing the document’s bagofwords, $w_1,..w_V$, can be obtained from multiplying each word’s probability (document specific), $\theta_{L_m(w_i)}$.
\begin{equation} P(C_xL, \theta_x) = \prod_{m\/in C_x}{}P(D_mL, \theta_{L_m}) \end{equation} \begin{equation} = \prod_{m\in C_x}.\prod_{i=1}^{V_m}\theta_{L_m(wi)}^{count(wi)} = \prod_{i=1}^{V}\theta_{L_m(w_i)}^{count(w_i \in C_x)} \end{equation}
 $P(\theta_x\gamma_\theta)$ is the probability of the multinomial word distribution, drawn from a dirichlet prior with hyperparameter $\gamma_\theta$. By definition,
\begin{equation} P(\theta_{L_m}\gamma_\theta) = c.\prod_{i=1}^{V}\theta_{L_m(i)}^{\gamma_{\theta_i}1} \end{equation}
 $P(L\pi)$ is the probability of obtaining a specific sequence L of N binary labels, given that the probability of choosing label=1 is $\pi$.
\begin{equation} P(L\pi) = \pi^{C_1}(1\pi)^{C_0} \end{equation}
 $P(\pi\gamma_\pi)$ is the probability of drawing $L$ given the prior Beta distribution with hyperparameters $\gamma_\pi=[\alpha,\beta]$. By definition of the Beta distribution, where c is a normalization constant:
\begin{align}
P(\pi\gamma_\pi) = c.\pi^{\alpha1}(1\pi)^{\beta1} \nonumber
\\
P(\pi\gamma_\pi) = \frac{\Gamma(\alpha+ \beta)}{\Gamma(\alpha)\Gamma(\beta)}.\pi^{\alpha1}(1\pi)^{\beta1}
\end{align}
The factorised joint distribution is thus,
\begin{equation} c.\prod_{i=1}^{V}\theta_{1(w_i)}^{count(w_i\in C_1)+\gamma_{\theta_i}1}\prod_{i=1}^{V}\theta_{0(w_i)}^{count(w_i\in C_0)+\gamma_{\theta_i}1}.\pi^{C_1+\alpha1}(1\pi)^{C_0+\beta1} \end{equation}
2.1 Integrating out $\pi$
We would like to simplify the problem by reducing the number of variables to sample. To do this, we can integrate out $\pi$ from the joint distribution, which has the effect of taking all possible values of $\pi$ into account without representing it explicitly at every iteration. That is,
\begin{align} P(C, L, \theta_0, \theta_1, \gamma_\theta, \gamma_\pi) = \int_{\pi}P(C, L, \pi,\theta_0, \theta_1, \gamma_\theta, \gamma_\pi)d\pi \nonumber \\ = \int_{\pi}P(L, \theta_0)P(L, \theta_1)P(\theta_0\gamma_\theta)P(\theta_1\gamma_\theta)P(L\pi)P(\pi\gamma_\pi)d\pi \nonumber \\ =P(L, \theta_0)P(L, \theta_1)P(\theta_0 \gamma_\theta)P(\theta_1\gamma_\theta)\int_{\pi}P(\pi)P(\pi\gamma_\pi)d\pi \end{align}
We can get rid of the integral by substituting the true distributions for $P(\pi)$ from eqn(12) and $P(\pi\gamma_\pi)$ from eqn(13).
\begin{align} \int_{\pi}P(\pi).P(\gamma_\pi)d\pi = \int_{\pi}\pi^{C_1}(1\pi)^{C_0}.\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\pi^{\alpha1}(1\pi)^{\beta1}d\pi \nonumber \\ = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.\int_{\pi}\pi^{C_1+\alpha1}(1\pi)^{C_0+\beta1}d\pi \end{align} Observe that the second part of this eqn is the beta distribution.
\begin{equation} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.\int_{\pi}Beta(C_1+\alpha,C_0+\beta)d\pi \end{equation}
The value of the integral of the Beta distribution is given by the normalizing constant. Hence,
\begin{equation} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.\frac{\Gamma(C_1+\alpha)\Gamma(C_0+\beta)}{\Gamma(C_0+C_1+\alpha+\beta)} \end{equation}
Substituting eqn (18) back into the joint distribution gives us,
\begin{equation} c.\prod_{i=1}^{V}\theta_{1(w_i)}^{count(w_i\in C_1)+\gamma_{\theta_i}1}\theta_{0(w_i)}^{count(w_i\in C_0)+\gamma_{\theta_i}1}.\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.\frac{\Gamma(C_1+\alpha)\Gamma(C_0+\beta)}{\Gamma(C_0+C_1+\alpha+\beta)} \end{equation}
3. Obtain equations for conditional update of variables
In this model (see plate diagram)  the document words, $W_{jk}$ are observed, and $\gamma_\pi, \gamma_\theta$ are hyperparameters of the model. $\pi, L, \theta_0, \theta_1$ are parameters of the model that need to be sampled. Having integrated out $\pi$, we are left with:
 $L$ is the binary label variable which must be sampled for each document
 $\theta_0, \theta_1$ is the probability of drawing a word from a multinomial distribution if the label of the document is 0 or 1 respectively. It is a vector with length $V$, where $V$ is the vocabulary.
To assign the value of $L_2^{(t+1)}$, we need to sample from the conditional distribution of \begin{equation} P(L_2L_1^{(t+1)},L_3^{(t)},..,L_N^{(t)}, C, \theta_0^{(t)}, \theta_1^{(t)}; \gamma_\theta, \gamma_\pi) \end{equation}
To assign the value of $\theta_0^{(t+1)}$, we need to sample from the conditional distribution of \begin{equation} P(\theta_0L_1^{(t+1)}, .., L_N^{(t+1)}, C, \theta_1^{(t)}, \gamma_\theta, \gamma_\pi) \end{equation}
To help us with all the counts, we create document and corpus classes that can be easily reused for texxt models related to probabilistic MCMC updates.
3.1 Conditional update of $L_j$
For each document in the corpus, we need to draw the update from a single Bernoulli trial. In order to do that, we need to get the relative probabilities of $P(L_j=x ..)$ for $x \in \lbrace 0,1 \rbrace$, as the parameter for the Binomial distribution.
\begin{equation} val0 = P(L_j=0L_{(j)}, C_{(j)}, \theta_0, \theta_1, \gamma_\pi, \gamma_\theta) = \frac{C_0+\gamma_{\pi_0}1}{C_0 + C_1 + \gamma_{\pi_1}+ \gamma_{\pi_0} 1}.\prod_{i=1}^{V}\theta_{0}^{W_{ji}} \end{equation}
\begin{equation} val1 = P(L_j=1L_{(j)}, C_{(j)}, \theta_0, \theta_1, \gamma_\pi, \gamma_\theta) = \frac{C_1+\gamma_{\pi_1}1}{C_0 + C_1 + \gamma_{\pi_1} + \gamma_{\pi_0} 1}.\prod_{i=1}^{V}\theta_{1}^{W_{ji}} \end{equation}
\begin{equation} L_j \sim Binomial(\frac{val0}{val0+val1}) \end{equation}
3.2 Conditional update of $\theta_0$ and $\theta_1$
For $\theta_0$ and $\theta_1$, we also need to sample new distributions condition on all other variables. However since we used conjugate priors, the posterior, like the prior, is a dirichlet distribution. The use of conjugate priors simplifies the update as we can draw a new $\theta_x$, where $x \in \lbrace 0, 1\rbrace$ by adding pseudocounts to the hyperparameter of the dirichlet prior:
\begin{equation} \theta_x \sim Dirichlet([(count(w_i\in C_x) + \gamma_{\theta_x(i)}), … , (count(w_v\in C_x)+\gamma_{\theta_x(v)})]) \end{equation}
4. Sample variables by running the Gibbs Sampling algorithm
4.1 Initialization
The state space of which we need to sample includes
 $\pi$, sampled once from a beta distribution and reused for each draw of $L$.
 $L$, binary label variables, one for each document,
 $\theta_0, \theta_1$, sampled once from a dirichlet distribution
4.2 Algorithm
There are two forloops for the variable update.
The inner loop predicts a new label for each document $j=1..N$, we
 remove counts associated with document j,
 assign a new label to it from Section 3.2
 add counts associated with the new label.
For the outer loop, iterations $t=1…T$,
 we draw a new distribution for $\theta_0$ and $\theta_1$.
Note that: once a new label is assigned, it changes the counts that affect the labeling of the subsequent documents.
Calculating Sample Statistics
According to MCMC theory, an estimate of the desired variable can be obtained by averaging across all samples obtained so far. Optionally, there is also
 Running independent Gibbs sequences and averaging the final value of each sequence.
 Discarding burnin samples
 Autocorrelation and lag
 Variations on Gibbs sampling aim to reduce autocorrelation
 Sampling the last value  Under reasonably general conditions, the distribution of the samples drawn converges to the through distribution. Thus for a large enough number of iterations, the final observation is effectively a sample point from the true distribution.
References
Resnik, P., & Hardisty, E. (2010). Gibbs sampling for the uninitiated