Before we move on to Gaussian mixture models, let’s take a look at DBSCAN, another popular clustering algorithm that illustrates a very different approach based on local density estimation. This approach allows the algorithm to identify clusters of arbitrary shapes.
Understanding DBSCAN Clustering Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm in data science known for its simplicity and effectiveness. Unlike traditional clustering algorithms like K-means, DBSCAN defines clusters as continuous regions of high density, making it robust to outliers and capable of identifying clusters of any shape.
How DBSCAN Works
Here’s a simplified explanation of how DBSCAN works:
- Defining Neighborhoods: For each instance, DBSCAN counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.
- Identifying Core Instances: If an instance has at least min_samples instances in its ε-neighborhood (including itself), it is considered a core instance. These core instances are located in dense regions.
- Cluster Formation: All instances in the neighborhood of a core instance belong to the same cluster. This may include other core instances, forming a single cluster.
- Anomaly Detection: Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

Implementing DBSCAN in Python
You can easily implement DBSCAN in Python using Scikit-Learn. Here’s a sample code snippet:
from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons # Generate sample data X, y = make_moons(n_samples=1000, noise=0.05) # Initialize and fit DBSCAN model dbscan = DBSCAN(eps=0.05, min_samples=5) dbscan.fit(X) # Accessing cluster labels and core sample indices print(dbscan.labels_) print(len(dbscan.core_sample_indices_)) print(dbscan.core_sample_indices_) print(dbscan.components_)

Interpreting Results
- Cluster labels are available in
dbscan.labels_, where instances with a cluster index of -1 are considered anomalies. - Core sample indices are accessible through
dbscan.core_sample_indices_. - Core instances themselves are available in
dbscan.components_.
Other Clustering Algorithms
Scikit-Learn implements several more clustering algorithms that you should take a
look at. We cannot cover them all in detail here, but here is a brief overview:
- Agglomerative clustering: a hierarchy of clusters is built from the bottom up. Think of many tiny bubbles floating on water and gradually attaching to each other until there’s just one big group of bubbles. Similarly, at each iteration agglomerative clustering connects the nearest pair of clusters (starting with individual instances). If you draw a tree with a branch for every pair of clusters that merged, you get a binary tree of clusters, where the leaves are the individual instances. This approach scales very well to large numbers of instances or clusters, it can capture clusters of various shapes, it produces a flexible and informative cluster tree instead of forcing you to choose a particular cluster scale, and it can be used with any pairwise distance. It can scale nicely to large numbers of instances if you provide a connectivity matrix. This is a sparse m by m matrix that indicates which pairs of instances are neighbors (e.g., returned by sklearn.neighbors.kneighbors_graph()). Without a connectivity matrix, the algorithm does not scale well to large datasets.
- Birch: this algorithm was designed specifically for very large datasets, and it can be faster than batch K-Means, with similar results, as long as the number of features is not too large (<20). It builds a tree structure during training containing just enough information to quickly assign each new instance to a cluster, without having to store all the instances in the tree: this allows it to use limited memory, while handle huge datasets.
- Mean-shift: this algorithm starts by placing a circle centered on each instance, then for each circle it computes the mean of all the instances located within it, and it shifts the circle so that it is centered on the mean. Next, it iterates this mean-shift step until all the circles stop moving (i.e., until each of them is centered on the mean of the instances it contains). This algorithm shifts the circles in the direction of higher density, until each of them has found a local density maximum. Finally, all the instances whose circles have settled in the same place (or close enough) are assigned to the same cluster. This has some of the same features as DBSCAN, in particular it can find any number of clusters of any shape, it has just one hyperparameter (the radius of the circles, called the bandwidth) and it relies on local density estimation. However, it tends to chop clusters into pieces when they have internal density variations. Unfortunately, its computational complexity is O(m2), so it is not suited for large datasets.
- Affinity propagation: this algorithm uses a voting system, where instances vote for similar instances to be their representatives, and once the algorithm converges, each representative and its voters form a cluster. This algorithm can detect any number of clusters of different sizes. Unfortunately, this algorithm has a computational complexity of O(m2 ), so it is not suited for large datasets.
- Spectral clustering: this algorithm takes a similarity matrix between the instances and creates a low-dimensional embedding from it (i.e., it reduces its dimensionality), then it uses another clustering algorithm in this low-dimensional space (Scikit-Learn’s implementation uses K-Means). Spectral clustering can capture complex cluster structures, and it can also be used to cut graphs (e.g., to identify clusters of friends on a social network), however it does not scale well to large number of instances, and it does not behave well when the clusters have very different sizes.
Conclusion
DBSCAN is a versatile clustering algorithm suitable for various applications. Its simplicity, robustness to outliers, and ability to identify clusters of any shape make it a valuable tool in data science. By implementing DBSCAN in Python using Scikit-Learn, you can efficiently analyze and cluster your data.
Important Notice For College Students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes. Follow us on medium.