EM Algorithm for Gaussian mixtures

Key Concepts

Gaussian Mixture Model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions, with unknown parameters
Expectation Maximization is an iterative algorithm that classifies data points with the model, updates model parameters with the newly classified data, and subsequently classifies data points with the new parameters.
We use Expectation Maximization to learn the parameters of the GMM in an unsupervised fashion.

Model Preliminaries

Datapoints are generated in an IID fashion from an underlying density $ p(X) $.

$ p(X) $ is defined as a finite mixture model with K components:

\begin{equation} p(X|\theta)=\sum_{k=1}^{K}\alpha_{k}.p_{k}(x|z_k, \theta_{k}) \end{equation}

Assuming our finite mixture model is a Gaussian mixture model, we can model $ p_{k}(x|\theta_k) $ as a Gaussian density that has parameters $ \theta_k = {\mu_k, \Sigma_k} $

\begin{equation} p_k(x|\theta_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}.e^{(-\frac{1}{2}(x-\mu_k)^t\Sigma_k^{-1}(x-\mu_k))} \end{equation}

$\alpha_k$ are mixture weights, representing the underlying probability of component k. i.e. the probability that a randomly selected X was generated by component $k$, where $\sum_{k=1}^{K} \alpha_k = 1 $.

Given a new datapoint, we want to compute the “membership weight”, $w_{ik}$ of the datapoint $x_i$ in cluster $k$, given the parameters $\theta$. Membership weights for a single datapoint should sum to 1.

Algorithm Preliminaries

The Expectation Maximization (EM) algorithm is used for parameter estimation in cases where labels are missing from training examples (partially observed). It proceeds iteratively, where each iteration has 2 steps. The first step is use initialised parameter values to predict labels. The next step is to use predicted labels from step 1 to estimate parameter values.
One of the key insights to why semi-supervised or unsupervised learning works, is that we can calculate $p(x)$ by marginalizing out the labels! i.e, given our estimates, we can check how good they are based on just the original data without labels.

\begin{equation} p(x) = \sum_{y=1}^{k} p(x, y) = \sum_{y=1}^k \; q(x|y).q(y) \end{equation}

Here $q(y)$ is the probability of seeing $y$, and $q(x|y)$ is the probability of the $x$ taking on the value $y$. Under a basic Naive Bayes model we can estimate(if there is some small supervised training set) or guess an initial distribution over the labels $q(y)$ (often by counts).
As a result, we can construct a likelihood function as a measure of how well our estimated parameter values fit the training examples.

Implementing the EM algorithm for Gaussian Mixture Models

The algorithm cycles between an E and M step, and iteratively updates $\theta$ until convergence is detected. Initially, either the membership weights can be initialised randomly (leading with M step) or model parameters initialised randomly (leading with E step).

We will need:

E-step
M-step
EM algorithm (E-step + M-step)
The log likelihood function for assessing model convergence
Progress plotting of parameters and loglikehood

E-step: Compute the membership weights $w_{ik}$ for all data points, and all mixture components.
- This is proportionate to the likelihood of the data point under a particular cluster assignment, multiplied by the weight of that cluster. $p_k(x_i|\theta_k).\alpha_k$
- P is a distribution with parameters $ \theta $. If Gaussian, then $ p=\mathcal{N} $ and $ \theta_k $: mean $ \mu_k $ and covariance $ \Sigma_k$.
- We divide by a normalizing constant to ensure all weights sum to 1:

\begin{equation} w_{ik} = \frac{p_k(x_i|\theta_k).\alpha_k}{\sum_{m=1}^{K}p_m(x_i|\theta_m).\alpha_m} \end{equation}

from scipy.stats import multivariate_normal as mvn

def compute_membership_weights(data, cluster_weights, means, covariances):
  member_weights = np.zeros((ndata, nclusters))

  for i in range(len(data)):
    for k in range(len(means)):
      member_weights[i, k] = cluster_weights[k]*mvn.pdf(data[i], mean=means[k], cov=covariances[k])
    
  row_sums = member_weights.sum(axis=1).reshape(-1, 1) #transpose
  member_weights = member_weights/row_sums

  return member_weights

M-step: Use the membership weights and the data to calculate new parameter values.
- Calculating cluster weights $\alpha_k^{new} = \frac{N_k}{N}, 1 \leq k \leq K$, where $N_k$ is the sum of cluster weight k, over all data points. i.e. How much weight is proportioned to k across the entire dataset.

def get_cluster_weights(member_weights):
  return np.sum(member_weights, axis=0)/len(member_weights)

Calculate mean $\mu_k$, and covariance $\Sigma_k$ of the mixture component

\begin{equation} \mu_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.x_i, 1 \leq k \leq K \end{equation}

\begin{equation} \Sigma_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.(x_i - \mu_k^{new})(x_i - \mu_k^{new})^t \end{equation}

def compute_cluster_means(data, member_weights):
  soft_counts = np.sum(member_weights, axis=0)
  means = [np.zeros(len(data[0]))] * num_clusters

  for k in range(len(soft_counts)):
    weighted_sum = 0.
    for i in range(len(data)):
      weighted_sum += data[i] * member_weights[i, k]
      means[k] = weighted_sum/soft_counts[k]
  return means

def compute_covariances(data, member_weights, means):
  soft_counts = np.sum(member_weights, axis=0)
  nclusters = len(soft_counts)
  ndim = len(data[0])
  covariances = [np.zeros((ndim, ndim))] * nclusters

  for k in range(nclusters):
    weighted_sum = np.zeros((ndim, ndim))
    for i in range(len(data)):
      weighted_sum += member_weights[i, k] * np.outer((data[i]-means[k]), (data[i]-means[k]))
    covariances[k] = weighted_sum/soft_counts[k]
  return covariances

EM-Algorithm: Running parameter updates iteratively
After computing model parameters(M-step), we can recompute the membership weights(E-step), which we then use to compute the model parameters(M-step) and you get the idea..
- As we run the algorithm, we expect the log likelihood to increase.
- The Log likelihood is the probability of observing a given set of data, under the current parameters of the model.
- Log likelihood is given by:

\[\sum_{i=1}^{N}log\;p(x_i|\theta) = \sum_{i=1}^{N}\;log\sum_{k=1}^{K}\alpha_{k}.p_k(x_i|z_k, \theta_k)\]

In a Gaussian mixture model, this becomes (2)

def log_sum_exp(Z):
  return np.max(Z) + np.log(np.sum(np.exp(Z - np.max(Z))))

def loglikelihood(data, weights, means, covs):
  num_clusters = len(means)
  num_dim = len(data[0])
  ll = 0
  for d in data:
    Z = np.zeros(num_clusters)
    for k in range(num_clusters):
      delta = np.array(d) - means[k]
      exponent_term = np.dot(delta.T, np.dot(np.linalg.inv(covs[k]), delta))
      Z[k] += np.log(weights[k])
      Z[k] -= 1/2. * (num_dim * np.log(2*np.pi)+np.log(np.linalg.det(covs[k]))+exponent_term)

  ll += log_sum_exp(Z)
  return ll

Run EM until Convergence: Repeat until there is no significant change(increase) for the log-likelihood after each iteration, or the number of iterations exceeds a maximum number.

def EM(data, init_means, init_covariances, init_cluster_weights, maxiter=1000, thresh=1e-4):
  #make copies of parameters
  means = init_means[:]
  covariances = init_covariances[:]
  cluster_weights = init_cluster_weights[:]

  membership_weights = np.zeros((len(data), len(means)))
  ll = loglikelihood(data, cluster_weights, means, covariances)
  ll_track = [ll]

  for itr in range(maxiter):
    membership_weights = compute_membership_weights(data, cluster_weights, means, covariances)
    cluster_weights = compute_cluster_weights(membership_weights)
    means = compute_means(data, membership_weights)
    covariances = compute_covariances(data, membership_weights, means)
    
    new_ll = loglikelihood(data, cluster_weights, means, covariances)
    ll_track.append(new_ll)

    if (new_ll-ll)<thresh and new_ll > np.inf:
      break
    ll = new_ll

  return {"membership_weights":membership_weights, "cluster_weights":, cluster_weights, "means":means, "covs": covariances, "lll":ll_track}

Strengths of the Algorithm

EM with GMM is an extension of K-means. Whereas in k-means we made hard assignments of the datapoints to each cluster centre, in EM-GMM a data point can have “membership weights” i.e., uncertainty over which cluster it belongs to.
Relation to K-means
K-means is a unique case of EM-GMM, when the variance of the cluster is 0. Then, the membership of the data point to the cluster depends solely on the distance to the cluster center.

Weaknesses of the Algorithm

EM is prone to converging to local optimum. In practice, use EM with multiple random initialization of cluster centers.
We need to specify the number of cluster centers and initialize cluster means.
- selecting cluster centre with kmeans, kmeans++
EM is prone to overfitting in high dimensions.
High dimensional data points have complexity $ D^2 \times K $ in the number of dimensions $D$ and number of clusters $K$
- To reduce the number of parameters, we can use diagonal covariance matrices. i.e. non diagonal elements are zero, only the diagonal of the matrix has covariance of the dimensions.

References

Emily Fox, UW/Coursera ML Specialization
Pahraic Smyth, UCI, CS274 Lecture Notes