Hierarchical Clustering Explained
Hierarchical clustering is an unsupervised machine learning algorithm used to group similar data points into clusters. Unlike K-Means, which requires the number of clusters to be defined beforehand, hierarchical clustering builds a tree-like structure of clusters, allowing you to choose the number of clusters after inspecting the results. This structure is known as a dendrogram.
Key Characteristics
Unsupervised: It does not require labeled data.
Tree-based: Builds a hierarchy of clusters.
Flexible: You don’t need to predefine the number of clusters.
Distance-based: Uses linkage methods and distance metrics to decide cluster similarity.
How Hierarchical Clustering Works
Hierarchical clustering works by either combining smaller clusters into larger ones (agglomerative) or dividing a large cluster into smaller ones (divisive). The most common method is agglomerative clustering.
Step 1: Start with Individual Clusters
Initially, treat each data point as its own cluster.
Step 2: Compute Distances Between Clusters
Calculate how close each cluster is to every other cluster. This is typically done using Euclidean distance and one of the following linkage methods:
-
Single linkage (minimum distance)
-
Complete linkage (maximum distance)
-
Average linkage (average distance)
-
Ward’s method (minimizes total within-cluster variance)
Step 3: Merge the Closest Clusters
Identify the two closest clusters based on the chosen linkage method and merge them.
Step 4: Repeat Until One Cluster Remains
Repeat the merging process until all data points are grouped into one big cluster.
Step 5: Choose the Number of Final Clusters
Use the dendrogram to visually select a cutoff height. Cutting the dendrogram at this height gives you the desired number of clusters.
Visualizing the Results: The Dendrogram
A dendrogram is a tree-like diagram that shows how clusters were formed at each step. The vertical lines represent clusters being merged. The longer the vertical line, the greater the distance (or dissimilarity) between merged clusters. To decide how many clusters to keep, you cut the dendrogram at a chosen height.
Summary Table
| Step | Action | Purpose |
|---|---|---|
| 1 | Treat each point as a cluster | Initial step |
| 2 | Compute distances between clusters | Determine similarity |
| 3 | Merge closest clusters | Build cluster hierarchy |
| 4 | Repeat until one cluster | Complete the tree |
| 5 | Cut dendrogram | Choose the final number of clusters |
Mathematical Foundation of Hierarchical Clustering
Hierarchical clustering relies on the computation of distances between data points and the use of a linkage criterion to merge clusters. The algorithm uses a distance matrix and linkage functions to determine which clusters to merge at each step.
Distance Matrix
At the core of hierarchical clustering is a distance matrix ( D ), where each element ( D(i, j) ) represents the distance between data point ( i ) and data point ( j ). For Euclidean distance, it is computed as:
[
D(i, j) = sqrt{sum_{k=1}^{n} (x_{ik} – x_{jk})^2}
]
Here, ( x_{ik} ) and ( x_{jk} ) are the ( k )-th features of data points ( i ) and ( j ), respectively.
Linkage Criteria
Once distances are computed, hierarchical clustering uses a linkage criterion to determine the distance between clusters. Common linkage methods include:
- Single Linkage: Minimum distance between any two points in different clusters
[
D(C_i, C_j) = min_{x in C_i, y in C_j} |x – y|
] - Complete Linkage: Maximum distance between any two points in different clusters
[
D(C_i, C_j) = max_{x in C_i, y in C_j} |x – y|
] - Average Linkage: Average of all pairwise distances between points in different clusters
[
D(C_i, C_j) = frac{1}{|C_i||C_j|} sum_{x in C_i} sum_{y in C_j} |x – y|
] - Ward’s Method: Increases in total within-cluster variance when merging two clusters. It tries to minimize the following:
[
D(C_i, C_j) = frac{|C_i||C_j|}{|C_i| + |C_j|} | mu_i – mu_j |^2
]
Major Advantages of Hierarchical Clustering
-
No need to specify the number of clusters in advance
-
Produces a dendrogram for visual and detailed cluster analysis
-
Effective for small to medium-sized datasets
-
Supports multiple linkage methods (single, complete, average, Ward’s) to fit different data structures
Major Challenges of Hierarchical Clustering
-
Not scalable to large datasets due to high time and space complexity
-
Sensitive to outliers and noise, which can distort the clustering structure
-
Final results are dependent on the linkage method used
-
Irreversible merges — once two clusters are merged, they cannot be split again
Common Use Cases
-
Gene expression analysis – Identifying gene groups with similar activity patterns
-
Customer segmentation – Grouping users by demographics or purchasing behavior
-
Document clustering – Organizing large sets of text into related topics or themes
-
Social network analysis – Detecting communities or clusters of users in networks
-
Anomaly detection – Finding rare or unexpected groupings in datasets
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
# Load real-world data: Iris dataset
iris = load_iris()
X = iris.data
feature_names = iris.feature_names
# Standardize features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform hierarchical clustering using Ward's method
linked = linkage(X_scaled, method='ward')
# Mobile-friendly dendrogram visualization
plt.figure(figsize=(6, 4)) # smaller figure for mobile view
dendrogram(
linked,
truncate_mode='lastp', # show only the last p merged clusters
p=30,
leaf_rotation=90.,
leaf_font_size=8.0
)
plt.title("Hierarchical Clustering Dendrogram (Iris)", fontsize=10)
plt.xlabel("Cluster Index", fontsize=9)
plt.ylabel("Distance", fontsize=9)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.grid(True)
plt.show()
Explanation of the Code
-
Dataset Selection
The Iris dataset is chosen from scikit-learn. It’s a popular real-world dataset with 150 samples of iris flowers described by four features. Since hierarchical clustering is unsupervised, we only use the feature values and ignore the class labels. -
Feature Scaling
Before clustering, the feature values are standardized. This ensures that all features contribute equally to the distance calculations, which is critical for distance-based algorithms like hierarchical clustering. -
Distance Calculation and Linkage Matrix
Hierarchical clustering starts by calculating the distances between all pairs of points. It then uses a method called Ward’s linkage, which aims to minimize the variance within clusters during the merging process. This step creates a linkage matrix, which tells the algorithm which clusters to merge at each step and at what distance. -
Building the Hierarchy
Initially, every data point is considered its own cluster. The algorithm finds the two closest clusters (based on Ward’s linkage) and merges them. This process is repeated, building a hierarchy of clusters from bottom-up until all points belong to one giant cluster. -
Dendrogram Visualization
A dendrogram is used to visualize how clusters are formed at each stage. Each branch of the tree represents a merge. By analyzing the height of these branches, one can decide where to “cut” the tree to form the desired number of clusters.
Conclusion
Hierarchical clustering is a powerful and intuitive unsupervised learning technique that builds a tree-like hierarchy of clusters without requiring you to define the number of clusters beforehand. By successively merging or dividing clusters based on distance metrics and linkage criteria, it uncovers natural groupings in the data and presents them in a visual format through a dendrogram.
Its flexibility, interpretability, and ability to reveal nested structures make it especially useful in exploratory data analysis, bioinformatics, customer segmentation, and anomaly detection. However, it does come with challenges—particularly with scalability and sensitivity to outliers.
When applied to real-world datasets like the Iris dataset, hierarchical clustering provides meaningful insights into underlying patterns. By standardizing data, applying an appropriate linkage method, and visualizing the results, one can make informed decisions about groupings that emerge naturally from the data.
Whether you’re working with small datasets or seeking rich visual insights through dendrograms, hierarchical clustering remains a go-to technique in the data scientist’s toolkit.
