Mastering K-Means and Hierarchical Clustering Techniques

K-Means Clustering

We learned about the K-Means Clustering Algorithm, which finds the centroid of the clusters
Initialize centroids at random
Assign observations to the cluster of the nearest centroid
Recalculate the centroids based on the cluster assignment
Repeat Steps 2 and 3 until the cluster assignments stop changing

Scatter plot showing different colored points representing two clusters with an iteration marker at the bottom labeled 'Iteration #0' and a decision boundary represented by a black line.

K-Means: Limitations

It is an iterative based approach that requires us to specify the number of clusters.

Determining the optimal number of clusters can be challenging, especially in cases where the data’s underlying structure is not well understood.
Final results can be sensitive to the initial placement of cluster centroids and the choice of the distance metric.
Assumes that clusters are spherical and isotropic, meaning they have similar sizes and densities in all directions.
Sensitive to outliers, where outliers can disproportionately influence the position of centroids, leading to suboptimal clustering results.

Hierarchical Clustering

Hierarchical Clustering – An alternative approach that does not require a pre-specified choice of K, and provides a deterministic answer
- It is based on distances between the data points in contrast to K-means, which focuses on distance from the centroid
- Agglomerative/Bottom-Up Clustering: We recursively merge similar clusters
- Divisive / Top-down Clustering: We recursively sub-divide into dissimilar sub clusters.

Agglomerative Clustering

Start with each point in its own cluster
Identify the two closest clusters (Ci,Cj)- merge them
- Repeat until all points are in a single cluster
- Set threshold t to 0.
- t = t + e ; where e is a small positive quantity.
- If the distance between any pair of clusters is < 𝜏, combine them
- Update all distances from/to the newly formed cluster

To visualize the results we can look at the corresponding dendrogram

Agglomerative Clustering

The agglomerative clustering is based on the measurement of the cluster similarity.
- How do we measure distance between a cluster and a point?
- How do we measure distance between the two clusters?
The choice of how to measure distances between clusters is called the linkage.
Linkage is a dissimilarity measure d(Ci,Cj )between two sets of clusters C1 and C2 telling us how different the points in these sets are.
Different Types:
- Single Linkage: Smallest of the distance between pairs of samples
- Complete Linkage: Largest of the distance between pairs of samples
- Average Linkage: Average of the distance between pairs of samples

Single Linkage

Single linkage also known as nearest-neighbour linkage measures the dissimilarity between C1 and C2, by looking at the smallest dissimilarity between two points in C1 and C2.

Pros: Can accurately handle non-elliptical shapes.
Cons:

Sensitive to noise and outliers.
May yield long extended clusters, with point in clusters being quite dissimilar [Chaining]

A visual representation of clustering showing different clusters labeled A, B, C, D, E, and F, with a red oval encompassing two green clusters and a blue connection between points A and B.

A dendrogram illustrating hierarchical clustering with labeled clusters A, B, C, D, and E.

Complete Linkage

Complete linkage also known as furthest-neighbour linkage measures the dissimilarity between C1 and C2 by looking at the largest dissimilarity between two points in C1and C2.

Diagram illustrating complete linkage clustering, showing two clusters (C1 and C2) with points labeled A, B, C, D, E, F, G, H, and I. The dashed line indicates the distance between the furthest points between the two clusters.

Pros: Less sensitive to noise and outliers.

Cons:Points in one cluster may be closer to points in another cluster than to any of its own cluster [Crowding]

Complete Linkage

Diagram illustrating clusters with nodes labeled A to F, showing interconnections and cluster boundaries.

Dendrogram illustration comparing single linkage clustering on the left and an alternative method on the right, showing the hierarchical structure of clusters labeled A, B, C, D, E, and F.

Average Linkage

In average linkage the dissimilarity between C1 and C2, is the average dissimilarity over all points in C1 and C2.

Mathematical equation representing the average dissimilarity measure between two clusters, C1 and C2, with variables showing the relationship between elements in each cluster.

Diagram illustrating average linkage clustering showing two clusters C1 and C2 with points labeled A to I, connected by blue dashed lines representing distances.

Tries to strike a balance.
Clusters tend to be relatively compact and relatively far apart.

Single-Link vs. Complete-Link Clustering

Two side-by-side illustrations showing clusters of points with different linkages in hierarchical clustering, labeled with a threshold value.

Example of Chaining and Crowding

Three scatter plots demonstrating Single Linkage, Complete Linkage, and Average Linkage clustering methods, each labeled above the corresponding plot.

Single linkage reproduces long chain like clusters, while Complete linkage strives to find compact clusters

Comments on Average Linkage

The average linkage is not invariant to increasing or decreasing transformations of the dissimilarity matrix d

Scatter plot comparing the clusters of data points using K-Means and another clustering method, with colors indicating different clusters. The Euclidean distance formula is annotated below each plot.

Dendrograms: Visualizing the Clustering

A 2-D diagram with all samples arranged along x-axis.
y-axis represents the distance threshold t.
Each sample/cluster has a vertical line until the threshold t at which they join

A dendrogram illustrating the hierarchical clustering process, showing clusters joined at different distance thresholds with blue data points represented along the bottom.

Dendrograms

Dendograms gives a visual representation of the whole hierarchical clustering process.
- Allows the user to make the final decision
Applicable irrespective of the dimensionality of the original samples
One can create multiple clustering based on the choice of threshold
Can be used with any distance metric and cluster distance measurement strategy (single, complete, average)
Effective and intuitive cluster validity measure.

A dendrogram illustrating hierarchical clustering, showing cluster distances and thresholds (τ1, τ2, τ3) along the y-axis.

K measure from Dendrogram

Outlier Detection With Dendrogram

The single isolated branch is suggestive of a data point that is very different to all others

A scatter plot showing data points with a highlighted outlier, alongside a dendrogram illustrating hierarchical clustering.

Lifetime and Cluster Validity

Lifetime •
- The time (T𝟐-T𝟏) from birth to death of a specific set of clusters
- The interval of during which a specific clustering prevails
Larger lifetime indicates that the clusters are stables over a larger range on distance threshold and hence more valid.

Divisive Clustering

Divisive Clustering groups data samples in a top-down fashion

Divisive Clustering

Start with all data in one cluster
Repeat until all clusters are singletons/until no dissimilarity remains between clusters1)

Choose one cluster Ck with the largest dissimilarity among its datapoints

Mathematical expression illustrating the maximum distance between clusters in red font.

Divide the selected cluster into two subclusters using different splitting criteria, such as partitioning the cluster along the dimension that maximizes the inter-cluster dissimilarity or using clustering algorithms like K-means to partition the cluster.
Update the cluster structure by replacing the original cluster with the two newly formed subclusters.
Repeat steps 1-3 until a stopping criteria is met.

Agglomerative vs. Divisive

Agglomerative Clustering

Bottom-Up Approach
Computationally Efficient
Robustness to initialization
Ease of Interpretation
Chaining Effect
Sensitivity to distance metric

Divisive Clustering

Top-down Approach
Computationally Expensive
Sensitivity to Initialization
Potential for Global OptimumLess Sensitive to Local Structure
Difficulty in Handling Noise

Hierarchical Clustering: Notes

Generalizes the clustering process to a hierarchy of cluster labels, which is an inherent nature of the real-world
We looked into a few basic, yet effective ones
- Improvements: BIRCH (scalable), ROCK (categorical), CHAMELEON
Cluster validity allows one to select a specific slice of the hierarchy of clusters
The approaches need a distance/similarity function between a pair of samples (need not be a simple metric like Euclidean)
Sensitive to outliers and distance metrics.

Hierarchical Clustering in Machine Learning

By

K-Means Clustering

K-Means: Limitations

Hierarchical Clustering

Agglomerative Clustering

Agglomerative Clustering

Single Linkage

Complete Linkage

Complete Linkage

Average Linkage

Single-Link vs. Complete-Link Clustering

Example of Chaining and Crowding

Dendrograms: Visualizing the Clustering

K measure from Dendrogram

Outlier Detection With Dendrogram

Lifetime and Cluster Validity

Divisive Clustering

Divisive Clustering

Agglomerative vs. Divisive

Divisive Clustering

Hierarchical Clustering: Notes

Like this:

Related

By

Related Post

Why RAG Chatbots Struggle in Production

Measuring ROI for a GenAI Initiative in Healthcare

Unique Strings with Odd and Even Swapping Allowed

Leave a ReplyCancel reply

You missed

Why RAG Chatbots Struggle in Production

Measuring ROI for a GenAI Initiative in Healthcare

Unique Strings with Odd and Even Swapping Allowed

Applying SOLID Principles and Dependency Injection in Python

By

K-Means Clustering

K-Means: Limitations

Hierarchical Clustering

Agglomerative Clustering

Agglomerative Clustering

Single Linkage

Complete Linkage

Complete Linkage

Average Linkage

Single-Link vs. Complete-Link Clustering

Example of Chaining and Crowding

Dendrograms: Visualizing the Clustering

K measure from Dendrogram

Outlier Detection With Dendrogram

Lifetime and Cluster Validity

Divisive Clustering

Divisive Clustering

Agglomerative vs. Divisive

Divisive Clustering

Hierarchical Clustering: Notes

Share this:

Like this:

Related

By

Related Post

Leave a ReplyCancel reply

You missed

Discover more from Geeky Codes