### Key Concepts

• Gaussian Mixture Model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions, with unknown parameters
• Expectation Maximization is an iterative algorithm that classifies data points with the model, updates model parameters with the newly classified data, and subsequently classifies data points with the new parameters.
• We use Expectation Maximization to learn the parameters of the GMM in an unsupervised fashion.

### Model Preliminaries

Datapoints are generated in an IID fashion from an underlying density $p(X)$.

$p(X)$ is defined as a finite mixture model with K components:

\begin{equation} p(X|\theta)=\sum_{k=1}^{K}\alpha_{k}.p_{k}(x|z_k, \theta_{k}) \end{equation}

Assuming our finite mixture model is a Gaussian mixture model, we can model $p_{k}(x|\theta_k)$ as a Gaussian density that has parameters $\theta_k = {\mu_k, \Sigma_k}$

\begin{equation} p_k(x|\theta_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}.e^{(-\frac{1}{2}(x-\mu_k)^t\Sigma_k^{-1}(x-\mu_k))} \end{equation}

$\alpha_k$ are mixture weights, representing the underlying probability of component k. i.e. the probability that a randomly selected X was generated by component $k$, where $\sum_{k=1}^{K} \alpha_k = 1$.

Given a new datapoint, we want to compute the “membership weight”, $w_{ik}$ of the datapoint $x_i$ in cluster $k$, given the parameters $\theta$. Membership weights for a single datapoint should sum to 1.

### Algorithm Preliminaries

• The Expectation Maximization (EM) algorithm is used for parameter estimation in cases where labels are missing from training examples (partially observed). It proceeds iteratively, where each iteration has 2 steps. The first step is use initialised parameter values to predict labels. The next step is to use predicted labels from step 1 to estimate parameter values.

• One of the key insights to why semi-supervised or unsupervised learning works, is that we can calculate $p(x)$ by marginalizing out the labels! i.e, given our estimates, we can check how good they are based on just the original data without labels.

\begin{equation} p(x) = \sum_{y=1}^{k} p(x, y) = \sum_{y=1}^k \; q(x|y).q(y) \end{equation}

• Here $q(y)$ is the probability of seeing $y$, and $q(x|y)$ is the probability of the $x$ taking on the value $y$. Under a basic Naive Bayes model we can estimate(if there is some small supervised training set) or guess an initial distribution over the labels $q(y)$ (often by counts).

• As a result, we can construct a likelihood function as a measure of how well our estimated parameter values fit the training examples.

### Implementing the EM algorithm for Gaussian Mixture Models

The algorithm cycles between an E and M step, and iteratively updates $\theta$ until convergence is detected. Initially, either the membership weights can be initialised randomly (leading with M step) or model parameters initialised randomly (leading with E step).

We will need:

1. E-step
2. M-step
3. EM algorithm (E-step + M-step)
4. The log likelihood function for assessing model convergence
5. Progress plotting of parameters and loglikehood
• E-step: Compute the membership weights $w_{ik}$ for all data points, and all mixture components.
• This is proportionate to the likelihood of the data point under a particular cluster assignment, multiplied by the weight of that cluster. $p_k(x_i|\theta_k).\alpha_k$
• P is a distribution with parameters $\theta$. If Gaussian, then $p=\mathcal{N}$ and $\theta_k$: mean $\mu_k$ and covariance $\Sigma_k$.
• We divide by a normalizing constant to ensure all weights sum to 1:

\begin{equation} w_{ik} = \frac{p_k(x_i|\theta_k).\alpha_k}{\sum_{m=1}^{K}p_m(x_i|\theta_m).\alpha_m} \end{equation}

• M-step: Use the membership weights and the data to calculate new parameter values.
• Calculating cluster weights $\alpha_k^{new} = \frac{N_k}{N}, 1 \leq k \leq K$, where $N_k$ is the sum of cluster weight k, over all data points. i.e. How much weight is proportioned to k across the entire dataset.
• Calculate mean $\mu_k$, and covariance $\Sigma_k$ of the mixture component

\begin{equation} \mu_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.x_i, 1 \leq k \leq K \end{equation}

\begin{equation} \Sigma_k^{new} = (\frac{1}{N_k})\sum_{i=1}^{N}w_{ik}.(x_i - \mu_k^{new})(x_i - \mu_k^{new})^t \end{equation}

• EM-Algorithm: Running parameter updates iteratively
• After computing model parameters(M-step), we can recompute the membership weights(E-step), which we then use to compute the model parameters(M-step) and you get the idea..
• As we run the algorithm, we expect the log likelihood to increase.
• The Log likelihood is the probability of observing a given set of data, under the current parameters of the model.
• Log likelihood is given by:
• In a Gaussian mixture model, this becomes (2)
• Run EM until Convergence: Repeat until there is no significant change(increase) for the log-likelihood after each iteration, or the number of iterations exceeds a maximum number.

#### Strengths of the Algorithm

• EM with GMM is an extension of K-means. Whereas in k-means we made hard assignments of the datapoints to each cluster centre, in EM-GMM a data point can have “membership weights” i.e., uncertainty over which cluster it belongs to.

#### Relation to K-means

• K-means is a unique case of EM-GMM, when the variance of the cluster is 0. Then, the membership of the data point to the cluster depends solely on the distance to the cluster center.

#### Weaknesses of the Algorithm

• EM is prone to converging to local optimum. In practice, use EM with multiple random initialization of cluster centers.
• We need to specify the number of cluster centers and initialize cluster means.
• selecting cluster centre with kmeans, kmeans++
• EM is prone to overfitting in high dimensions.
• High dimensional data points have complexity $D^2 \times K$ in the number of dimensions $D$ and number of clusters $K$
• To reduce the number of parameters, we can use diagonal covariance matrices. i.e. non diagonal elements are zero, only the diagonal of the matrix has covariance of the dimensions.