Hierarchical Clustering Explained

Hierarchical clustering is an unsupervised machine learning algorithm used to group similar data points into clusters. Unlike K-Means, which requires the number of clusters to be defined beforehand, hierarchical clustering builds a tree-like structure of clusters, allowing you to choose the number of clusters after inspecting the results. This structure is known as a dendrogram.

Key Characteristics

Unsupervised: It does not require labeled data.
Tree-based: Builds a hierarchy of clusters.
Flexible: You don’t need to predefine the number of clusters.
Distance-based: Uses linkage methods and distance metrics to decide cluster similarity.

How Hierarchical Clustering Works

Hierarchical clustering works by either combining smaller clusters into larger ones (agglomerative) or dividing a large cluster into smaller ones (divisive). The most common method is agglomerative clustering.

Step 1: Start with Individual Clusters
Initially, treat each data point as its own cluster.

Step 2: Compute Distances Between Clusters
Calculate how close each cluster is to every other cluster. This is typically done using Euclidean distance and one of the following linkage methods:

Single linkage (minimum distance)
Complete linkage (maximum distance)
Average linkage (average distance)
Ward’s method (minimizes total within-cluster variance)

Step 3: Merge the Closest Clusters
Identify the two closest clusters based on the chosen linkage method and merge them.

Step 4: Repeat Until One Cluster Remains
Repeat the merging process until all data points are grouped into one big cluster.

Step 5: Choose the Number of Final Clusters
Use the dendrogram to visually select a cutoff height. Cutting the dendrogram at this height gives you the desired number of clusters.

Visualizing the Results: The Dendrogram

A dendrogram is a tree-like diagram that shows how clusters were formed at each step. The vertical lines represent clusters being merged. The longer the vertical line, the greater the distance (or dissimilarity) between merged clusters. To decide how many clusters to keep, you cut the dendrogram at a chosen height.

Summary Table

Step	Action	Purpose
1	Treat each point as a cluster	Initial step
2	Compute distances between clusters	Determine similarity
3	Merge closest clusters	Build cluster hierarchy
4	Repeat until one cluster	Complete the tree
5	Cut dendrogram	Choose the final number of clusters

Mathematical Foundation of Hierarchical Clustering

Hierarchical clustering relies on the computation of distances between data points and the use of a linkage criterion to merge clusters. The algorithm uses a distance matrix and linkage functions to determine which clusters to merge at each step.

Distance Matrix

At the core of hierarchical clustering is a distance matrix ( D ), where each element ( D(i, j) ) represents the distance between data point ( i ) and data point ( j ). For Euclidean distance, it is computed as:

[
D(i, j) = sqrt{sum_{k=1}^{n} (x_{ik} – x_{jk})^2}
]

Here, ( x_{ik} ) and ( x_{jk} ) are the ( k )-th features of data points ( i ) and ( j ), respectively.

Linkage Criteria

Once distances are computed, hierarchical clustering uses a linkage criterion to determine the distance between clusters. Common linkage methods include:

Single Linkage: Minimum distance between any two points in different clusters
[
D(C_i, C_j) = min_{x in C_i, y in C_j} |x – y|
]
Complete Linkage: Maximum distance between any two points in different clusters
[
D(C_i, C_j) = max_{x in C_i, y in C_j} |x – y|
]
Average Linkage: Average of all pairwise distances between points in different clusters
[
D(C_i, C_j) = frac{1}{|C_i||C_j|} sum_{x in C_i} sum_{y in C_j} |x – y|
]
Ward’s Method: Increases in total within-cluster variance when merging two clusters. It tries to minimize the following:
[
D(C_i, C_j) = frac{|C_i||C_j|}{|C_i| + |C_j|} | mu_i – mu_j |^2
]

Major Advantages of Hierarchical Clustering

No need to specify the number of clusters in advance
Produces a dendrogram for visual and detailed cluster analysis
Effective for small to medium-sized datasets
Supports multiple linkage methods (single, complete, average, Ward’s) to fit different data structures

Major Challenges of Hierarchical Clustering

Not scalable to large datasets due to high time and space complexity
Sensitive to outliers and noise, which can distort the clustering structure
Final results are dependent on the linkage method used
Irreversible merges — once two clusters are merged, they cannot be split again

Common Use Cases

Gene expression analysis – Identifying gene groups with similar activity patterns
Customer segmentation – Grouping users by demographics or purchasing behavior
Document clustering – Organizing large sets of text into related topics or themes
Social network analysis – Detecting communities or clusters of users in networks
Anomaly detection – Finding rare or unexpected groupings in datasets

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

				
					import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
# Load real-world data: Iris dataset
iris = load_iris()
X = iris.data
feature_names = iris.feature_names
# Standardize features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform hierarchical clustering using Ward's method
linked = linkage(X_scaled, method='ward')
# Mobile-friendly dendrogram visualization
plt.figure(figsize=(6, 4))  # smaller figure for mobile view
dendrogram(
    linked,
    truncate_mode='lastp',   # show only the last p merged clusters
    p=30,
    leaf_rotation=90.,
    leaf_font_size=8.0
)
plt.title("Hierarchical Clustering Dendrogram (Iris)", fontsize=10)
plt.xlabel("Cluster Index", fontsize=9)
plt.ylabel("Distance", fontsize=9)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.tight_layout()
plt.grid(True)
plt.show()

Explanation of the Code

Dataset Selection
The Iris dataset is chosen from scikit-learn. It’s a popular real-world dataset with 150 samples of iris flowers described by four features. Since hierarchical clustering is unsupervised, we only use the feature values and ignore the class labels.
Feature Scaling
Before clustering, the feature values are standardized. This ensures that all features contribute equally to the distance calculations, which is critical for distance-based algorithms like hierarchical clustering.
Distance Calculation and Linkage Matrix
Hierarchical clustering starts by calculating the distances between all pairs of points. It then uses a method called Ward’s linkage, which aims to minimize the variance within clusters during the merging process. This step creates a linkage matrix, which tells the algorithm which clusters to merge at each step and at what distance.
Building the Hierarchy
Initially, every data point is considered its own cluster. The algorithm finds the two closest clusters (based on Ward’s linkage) and merges them. This process is repeated, building a hierarchy of clusters from bottom-up until all points belong to one giant cluster.
Dendrogram Visualization
A dendrogram is used to visualize how clusters are formed at each stage. Each branch of the tree represents a merge. By analyzing the height of these branches, one can decide where to “cut” the tree to form the desired number of clusters.

Conclusion

Hierarchical clustering is a powerful and intuitive unsupervised learning technique that builds a tree-like hierarchy of clusters without requiring you to define the number of clusters beforehand. By successively merging or dividing clusters based on distance metrics and linkage criteria, it uncovers natural groupings in the data and presents them in a visual format through a dendrogram.

Its flexibility, interpretability, and ability to reveal nested structures make it especially useful in exploratory data analysis, bioinformatics, customer segmentation, and anomaly detection. However, it does come with challenges—particularly with scalability and sensitivity to outliers.

When applied to real-world datasets like the Iris dataset, hierarchical clustering provides meaningful insights into underlying patterns. By standardizing data, applying an appropriate linkage method, and visualizing the results, one can make informed decisions about groupings that emerge naturally from the data.

Whether you’re working with small datasets or seeking rich visual insights through dendrograms, hierarchical clustering remains a go-to technique in the data scientist’s toolkit.