Clustering Methodologies: k-Means, Hierarchical, and Density-Based Approaches (DBSCAN)

Clustering is a fundamental unsupervised learning technique used to group similar data points together. In this lesson, we will dive into three popular clustering methodologies: k-Means, Hierarchical Clustering, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN).

What is Clustering?

Clustering involves partitioning a dataset into groups (or clusters) where data within each group are more similar to each other than to those in other groups. This is widely used in customer segmentation, image analysis, anomaly detection, and more.

Types of Clustering Algorithms

k-Means Clustering

k-Means is one of the simplest and most widely used clustering algorithms. It partitions data into k clusters by minimizing the variance within each cluster.

from sklearn.cluster import KMeans
import numpy as np

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Create a k-Means model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

In this example, k-Means assigns each point to one of two clusters.

Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters either agglomeratively (bottom-up) or divisively (top-down). The result can be visualized using a dendrogram.

from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Perform hierarchical clustering
linked = linkage(X, 'single')

# Plot dendrogram
dendrogram(linked)
plt.show()

This approach is useful when the number of clusters is unknown beforehand.

Density-Based Clustering (DBSCAN)

DBSCAN identifies clusters based on density, making it robust to noise and capable of finding arbitrarily shaped clusters.

from sklearn.cluster import DBSCAN

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=2).fit(X)
print(dbscan.labels_)

Points labeled as -1 are considered noise.

Choosing the Right Algorithm

The choice of clustering algorithm depends on the data and the problem at hand:

Experiment with these methods to find the best fit for your specific dataset!