A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size, density and orientation, just like in Figure below. When you observe an instance, you know it was generated from one of the Gaussian distribution ,but you are not told which one, and you do not know what the parameters of these distributions are. There are several GMM variants: in the simplest variant, implemented in the GaussianMixture class, you must know in advance the number k of Gaussian distributions. The dataset X is assumed to have been generated through the following probabilistic process:

  • For each instance, a cluster is picked randomly among k clusters. The probability
    of choosing the jth cluster is defined by the cluster’s weight ϕ (j)7 The index of the cluster chosen for the ith instance is noted z(i).
  • If z(i)=j, meaning the ith instance has been assigned to the jth cluster, the location x
    (i) of this instance is sampled randomly from the Gaussian distribution with mean μ(j)and covariance matrix Σ (j). This is noted ᅴi ׂ ᇬ μj, Σj
    .

This generative process can be represented as a graphical model .This is a graph which represents the structure of the conditional dependencies between random variables.

Diagram illustrating a Gaussian mixture model (GMM) with labels, instances, weight parameters, means, and covariance matrices, along with plate notation indicating repetition.
Gaussian mixture model

Here is how to interpret it:

  • The circles represent random variables.
  • The squares represent fixed values (i.e., parameters of the model)
  • The large rectangles are called plates: they indicate that their content is repeated several times.
  • The number indicated at the bottom right hand side of each plate indicates how many times its content is repeated, so there are m random variables z(i)(from z(1)to z^(m)) and m random variables x^(i), and k means μ(j)and k covariance matrices Σ(j), but just one weight vector ϕ (containing all the weights ϕ^(1) to ϕ^(k)).
  • Each variable z^(i)is drawn from the categorical distribution with weights ϕ. Each variable x^(i) is drawn from the normal distribution with the mean and covariance matrix defined by its cluster z^(i)
  • The solid arrows represent conditional dependencies. For example, the probability distribution for each random variable z(i) depends on the weight vector ϕ. Note that when an arrow crosses a plate boundary, it means that it applies to all the repetitions of that plate, so for example the weight vector ϕ conditions the probability distributions of all the random variables x^(1) to x^(m)

So what can you do with such a model? Well, given the dataset X, you typically want to start by estimating the weights ϕ and all the distribution parameters μ(1) to μ(k) and Σ(1) to Σ (k) . Scikit-Learn’s GaussianMixture class makes this trivial:

Let’s look at the parameters that the algorithm estimated:

>>> gm.weights_
array([0.20965228, 0.4000662 , 0.39028152])
>>> gm.means_
array([[ 3.39909717, 1.05933727],
[-1.40763984, 1.42710194],
[ 0.05135313, 0.07524095]])
>>> gm.covariances_
array([[[ 1.14807234, -0.03270354],
[-0.03270354, 0.95496237]],
[[ 0.63478101, 0.72969804],
[ 0.72969804, 1.1609872 ]],

Great, it worked fine! Indeed, the weights that were used to generate the data were 0.2, 0.4 and 0.4, and similarly, the means and covariance matrices were very close to those found by the algorithm. But how? This class relies on the ExpectationMaximization (EM) algorithm, which has many similarities with the K-Means algorithm: it also initializes the cluster parameters randomly, then it repeats two steps until convergence, first assigning instances to clusters (this is called the expectation step) then updating the clusters (this is called the maximization step). Sounds familiar? Indeed, in the context of clustering you can think of EM as a generalization of KMeans which not only finds the cluster centers (μ(1) to μ(k)), but also their size, shapeand orientation (Σ(1) to Σ(k)), as well as their relative weights (ϕ(1) to ϕ(k)). Unlike KMeans, EM uses soft cluster assignments rather than hard assignments: for each instance during the expectation step, the algorithm estimates the probability that it belongs to each cluster (based on the current cluster parameters). Then, during the maximization step, each cluster is updated using all the instances in the dataset, with each instance weighted by the estimated probability that it belongs to that cluster. These probabilities are called the responsibilities of the clusters for the instances. During the maximization step, each cluster’s update will mostly be impacted by the instances it is most responsible for.

Unfortunately, just like K-Means, EM can end up converging to poor solutions, so it needs to be run several times, keeping only the best solution. This is why we set n_init to 10. Be careful: by default n_init is only set to 1.

You can check whether or not the algorithm converged and how many iterations it took:

>>> gm.converged_
True
>>> gm.n_iter_
3

Okay, now that you have an estimate of the location, size, shape, orientation and relative weight of each cluster, the model can easily assign each instance to the most likely cluster (hard clustering) or estimate the probability that it belongs to a particular cluster (soft clustering). For this, just use the predict() method for hard clustering, or the predict_proba() method for soft clustering:

>>> gm.predict(X)
array([2, 2, 1, ..., 0, 0, 0])
>>> gm.predict_proba(X)
array([[2.32389467e-02, 6.77397850e-07, 9.76760376e-01],
[1.64685609e-02, 6.75361303e-04, 9.82856078e-01],
[2.01535333e-06, 9.99923053e-01, 7.49319577e-05],
...,
[9.99999571e-01, 2.13946075e-26, 4.28788333e-07],
[1.00000000e+00, 1.46454409e-41, 5.12459171e-16],
[1.00000000e+00, 8.02006365e-41, 2.27626238e-15]])

It is a generative model, meaning you can actually sample new instances from it (note that they are ordered by cluster index):

>>> X_new
array([[ 2.95400315, 2.63680992],
[-1.16654575, 1.62792705],
[-1.39477712, -1.48511338],
[ 0.27221525, 0.690366 ],
[ 0.54095936, 0.48591934],
[ 0.38064009, -0.56240465]])
>>> y_new
array([0, 1, 2, 2, 2, 2])

It is also possible to estimate the density of the model at any given location. This is achieved using the score_samples() method: for each instance it is given, this method estimates the log of the probability density function (PDF) at that location. The greater the score, the higher the density:

>>> gm.score_samples(X)
array([-2.60782346, -3.57106041, -3.33003479, ..., -3.51352783,
-4.39802535, -3.80743859])

If you compute the exponential of these scores, you get the value of the PDF at the location of the given instances. These are not probabilities, but probability densities: they can take on any positive value, not just between 0 and 1. To estimate the probability that an instance will fall within a particular region, you would have to integrate the PDF over that region (if you do so over the entire space of possible instance locations, the result will be 1).
Figure below shows the cluster means, the decision boundaries (dashed lines), and the density contours of this model:

A contour plot showing two Gaussian clusters with density contours in blue and green, marked by X symbols, and dashed red lines indicating decision boundaries.
Cluster means, decision boundaries and density contours of a trained Gaus‐
sian mixture model

Nice! The algorithm clearly found an excellent solution. Of course, we made its task easy by actually generating the data using a set of 2D Gaussian distributions (unfortunately, real life data is not always so Gaussian and low-dimensional), and we also gave the algorithm the correct number of clusters. When there are many dimensions, or many clusters, or few instances, EM can struggle to converge to the optimal solution. You might need to reduce the difficulty of the task by limiting the number of parameters that the algorithm has to learn: one way to do this is to limit the range of shapes and orientations that the clusters can have. This can be achieved by imposing constraints on the covariance matrices. To do this, just set the covariance_type hyper‐ parameter to one of the following values:

  • “spherical”: all clusters must be spherical, but they can have different diameters (i.e., different variances).
  • “diag”: clusters can take on any ellipsoidal shape of any size, but the ellipsoid’s axes must be parallel to the coordinate axes (i.e., the covariance matrices must be diagonal).
  • “tied”: all clusters must have the same ellipsoidal shape, size and orientation (i.e., all clusters share the same covariance matrix).

By default, covariance_type is equal to “full”, which means that each cluster can take on any shape, size and orientation (it has its own unconstrained covariance matrix). Figure below plots the solutions found by the EM algorithm when covariance_type is set to “tied” or “spherical“.

Comparison of Gaussian Mixture Model cluster boundaries with different covariance types: 'tied' on the left and 'spherical' on the right, depicting clusters with contours and decision boundaries.
covariance_type_diagram

The computational complexity of training a GaussianMixture model depends on the number of instances m, the number of dimensions n, the number of clusters k, and the constraints on the covariance matrices. If covariance_type is “spherical or “diag”, it is O(kmn), assuming the data has a clustering structure. If covariance_type is “tied” or “full”, it is O(kmn2+kn3), so it will not scale to large numbers of features.

By

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading