EM Algorithm for Gaussian mixtures
Key Concepts
- Gaussian Mixture Model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions, with unknown parameters
- Expectation Maximization is an iterative algorithm that classifies data points with the model, updates model parameters with the newly classified data, and subsequently classifies data points with the new parameters.
- We use Expectation Maximization to learn the parameters of the GMM in an unsupervised fashion.
Model Preliminaries
Datapoints are generated in an IID fashion from an underlying density $ p(X) $.
$ p(X) $ is defined as a finite mixture model with K components:
\begin{equation} p(X|\theta)=\sum_{k=1}^{K}\alpha_{k}.p_{k}(x|z_k, \theta_{k}) \end{equation}
Assuming our finite mixture model is a Gaussian mixture model, we can model $ p_{k}(x|\theta_k) $ as a Gaussian density that has parameters $ \theta_k = {\mu_k, \Sigma_k} $
\begin{equation} p_k(x|\theta_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}.e^{(-\frac{1}{2}(x-\mu_k)^t\Sigma_k^{-1}(x-\mu_k))} \end{equation}
$\alpha_k$ are mixture weights, representing the underlying probability of component k. i.e. the probability that a randomly selected X was generated by component $k$, where $\sum_{k=1}^{K} \alpha_k = 1 $.
Given a new datapoint, we want to compute the “membership weight”, $w_{ik}$ of the datapoint $x_i$ in cluster $k$, given the parameters $\theta$. Membership weights for a single datapoint should sum to 1.
Algorithm Preliminaries
-
The Expectation Maximization (EM) algorithm is used for parameter estimation in cases where labels are missing from training examples (partially observed). It proceeds iteratively, where each iteration has 2 steps. The first step is use initialised parameter values to predict labels. The next step is to use predicted labels from step 1 to estimate parameter values.
-
One of the key insights to why semi-supervised or unsupervised learning works, is that we can calculate $p(x)$ by marginalizing out the labels! i.e, given our estimates, we can check how good they are based on just the original data without labels.
\begin{equation} p(x) = \sum_{y=1}^{k} p(x, y) = \sum_{y=1}^k \; q(x|y).q(y) \end{equation}
-
Here $q(y)$ is the probability of seeing $y$, and $q(x|y)$ is the probability of the $x$ taking on the value $y$. Under a basic Naive Bayes model we can estimate(if there is some small supervised training set) or guess an initial distribution over the labels $q(y)$ (often by counts).
-
As a result, we can construct a likelihood function as a measure of how well our estimated parameter values fit the training examples.
Implementing the EM algorithm for Gaussian Mixture Models
The algorithm cycles between an E and M step, and iteratively updates $\theta$ until convergence is detected. Initially, either the membership weights can be initialised randomly (leading with M step) or model parameters initialised randomly (leading with E step).
We will need:
- E-step
- M-step
- EM algorithm (E-step + M-step)
- The log likelihood function for assessing model convergence
- Progress plotting of parameters and loglikehood
- E-step: Compute the membership weights $w_{ik}$ for all data points, and all mixture components.
- This is proportionate to the likelihood of the data point under a particular cluster assignment, multiplied by the weight of that cluster. \(p_k(x_i|\theta_k).\alpha_k\)
- P is a distribution with parameters $ \theta $. If Gaussian, then $ p=\mathcal{N} $ and $ \theta_k $: mean $ \mu_k $ and covariance $ \Sigma_k$.
- We divide by a normalizing constant to ensure all weights sum to 1:
\begin{equation} w_{ik} = \frac{p_k(x_i|\theta_k).\alpha_k}{\sum_{m=1}^{K}p_m(x_i|\theta_m).\alpha_m} \end{equation}
- M-step: Use the membership weights and the data to calculate new parameter values.
- Calculating cluster weights \(\alpha_k^{new} = \frac{N_k}{N}, 1 \leq k \leq K\), where \(N_k\) is the sum of cluster weight k, over all data points. i.e. How much weight is proportioned to k across the entire dataset.
- Calculate mean $\mu_k$, and covariance $\Sigma_k$ of the mixture component
\begin{equation} \mu_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.x_i, 1 \leq k \leq K \end{equation}
\begin{equation} \Sigma_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.(x_i - \mu_k^{new})(x_i - \mu_k^{new})^t \end{equation}
- EM-Algorithm: Running parameter updates iteratively
- After computing model parameters(M-step), we can recompute the membership weights(E-step), which we then use to compute the model parameters(M-step) and you get the idea..
- As we run the algorithm, we expect the log likelihood to increase.
- The Log likelihood is the probability of observing a given set of data, under the current parameters of the model.
- Log likelihood is given by:
- In a Gaussian mixture model, this becomes (2)
- Run EM until Convergence: Repeat until there is no significant change(increase) for the log-likelihood after each iteration, or the number of iterations exceeds a maximum number.
Strengths of the Algorithm
- EM with GMM is an extension of K-means. Whereas in k-means we made hard assignments of the datapoints to each cluster centre, in EM-GMM a data point can have “membership weights” i.e., uncertainty over which cluster it belongs to.
Relation to K-means
- K-means is a unique case of EM-GMM, when the variance of the cluster is 0. Then, the membership of the data point to the cluster depends solely on the distance to the cluster center.
Weaknesses of the Algorithm
- EM is prone to converging to local optimum. In practice, use EM with multiple random initialization of cluster centers.
- We need to specify the number of cluster centers and initialize cluster means.
- selecting cluster centre with kmeans, kmeans++
- EM is prone to overfitting in high dimensions.
- High dimensional data points have complexity $ D^2 \times K $ in the number of dimensions $D$ and number of clusters $K$
- To reduce the number of parameters, we can use diagonal covariance matrices. i.e. non diagonal elements are zero, only the diagonal of the matrix has covariance of the dimensions.
References
Emily Fox, UW/Coursera ML Specialization
Pahraic Smyth, UCI, CS274 Lecture Notes