Rather than manually searching for the optimal number of clusters, it is possible to use instead the BayesianGaussianMixture class which is capable of giving weights equal (or close) to zero to unnecessary clusters. Just set the number of clusters n_components to a value that you have good reason to believe is greater than the optimal number of clusters (this assumes some minimal knowledge about the problem at hand), and the algorithm will eliminate the unnecessary clusters automatically. For example, let’s set the number of clusters to 10 and see what happens:
from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)
bgm.fit(X)
np.round(bgm.weights_, 2)
array([0.4 , 0.21, 0.4 , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
Perfect: the algorithm automatically detected that only 3 clusters are needed, and the
resulting clusters are almost identical. In this model, the cluster parameters (including the weights, means and covariance matrices) are not treated as fixed model parameters anymore, but as latent random variables, like the cluster assignments . So z now includes both the cluster parameters and the cluster assignments

Prior knowledge about the latent variables z can be encoded in a probability distribution p(z) called the prior. For example, we may have a prior belief that the clusters are likely to be few (low concentration), or conversely, that they are more likely to be plentiful (high concentration). This can be adjusted using the weight_concentration_prior hyperparameter. Setting it to 0.01 or 1000 gives very different clusterings (see Figure 9-23). However, the more data we have, the less the priors matter. In fact, to plot diagrams with such large differences, you must use very strong priors and little data.

The fact that you see only 3 regions in the right plot although there are 4 centroids is not a bug: the weight of the top-right cluster is much larger than the weight of the lower-right cluster, so the probability that any given point in this region belongs to the top-right cluster is greater than the probability that it belongs to the lowerright cluster, even near the lower-right cluster.
Bayes’ theorem tells us how to update the probability distribution over the latent variables after we observe some data X. It computes the posterior distribution p(z|X), which is the conditional probability of z given X

Unfortunately, in a Gaussian mixture model (and many other problems), the denominator p(x) is intractable, as it requires integrating over all the possible values of z. This means considering all possible combinations of cluster parameters and cluster assignments.

This is one of the central problems in Bayesian statistics, and there are several approaches to solving it. One of them is variational inference, which picks a family of distributions q(z; λ) with its own variational parameters λ (lambda), then it optimizes these parameters to make q(z) a good approximation of p(z|X). This is achieved by finding the value of λ that minimizes the KL divergence from q(z) to p(z|X), noted DKL(q‖p). The KL divergence equation and it can be rewritten as the log of the evidence (log p(X)) minus the evidence lower bound (ELBO). Since the log of the evidence does not depend on q, it is a constant term, so minimizing the KL divergence just requires maximizing the ELBO.

In practice, there are different techniques to maximize the ELBO. In mean field variational inference, it is necessary to pick the family of distributions q(z; λ) and the prior p(z) very carefully to ensure that the equation for the ELBO simplifies to a form that can actually be computed. Unfortunately, there is no general way to do this, it depends on the task and requires some mathematical skills. For example, the distributions and lower bound equations used in Scikit-Learn’s BayesianGaussianMixture class are presented in the documentation. From these equations it is possible to derive update equations for the cluster parameters and assignment variables: these are then used very much like in the Expectation-Maximization algorithm. In fact, the computational complexity of the BayesianGaussianMixture class is similar to that of the GaussianMixture class (but generally significantly slower). A simpler approach to maximizing the ELBO is called black box stochastic variational inference (BBSVI): at each iteration, a few samples are drawn from q and they are used to estimate the gradients of the ELBO with regards to the variational parameters λ, which are then used in a gradient ascent step. This approach makes it possible to use Bayesian inference with any kind of model (provided it is differentiable), even deep neural networks: this is called Bayesian deep learning.