K-means Clustering

K-Means Clustering Explained Simply

K-Means clustering is an unsupervised machine learning algorithm used to group similar data points into clusters. The goal of K-Means is to organize data into groups where items in the same group (or cluster) are more similar to each other than to items in other groups. It’s called “K-Means” because you choose a number of clusters (denoted as “K”), and the algorithm tries to find the best grouping.

Key Characteristics:

  • Unsupervised: It does not require labeled data.
  • Centroid-based: Each cluster is represented by the mean (centroid) of the data points in that cluster.
  • Iterative: The algorithm repeats assignment and update steps until convergence.
  • Distance-based: Most commonly uses Euclidean distance to measure similarity.

How K-Means Clustering Works

It works by iteratively improving group assignments and cluster centres.

Step 1: Choose the Number of Clusters (K)

Decide how many groups (clusters) you want the data to be divided into.

  • Example: K = 3 → Group the data into 3 clusters.

Step 2: Initialize K Centroids

Randomly select K initial centroids (the starting points for each cluster’s center). These can be:

  • Random data points from your dataset.
  • Random coordinates in the data space.

Step 3: Assign Data Points to the Nearest Centroid

For each data point:

  • Calculate its distance to all centroids (usually using Euclidean distance).
  • Assign it to the nearest centroid.

Now, the dataset is grouped into K clusters based on proximity.

Step 4: Update the Centroids

For each cluster:

  • Compute the mean of all data points assigned to that cluster.
  • Set this mean as the new centroid.

This updates the cluster centers to better fit the data.

Step 5: Repeat Steps 3 and 4

Repeat the process of:

  • Assigning data points to the nearest centroid
  • Recalculating the new centroids

Keep doing this until:

  • The centroids don’t move much anymore, or
  • The data point assignments stop changing

This is called convergence.

Step 6: Final Clusters

Once the algorithm converges:

  • The positions of the centroids are fixed.
  • Each data point is permanently assigned to one of the K final clusters.

These clusters are your result — data grouped by similarity.

 

Summary Table

Step Action Purpose
1 Choose K Decide how many clusters to make
2 Initialize K centroids Starting guess for cluster centers
3 Assign points to nearest centroid Form temporary clusters
4 Recalculate centroids Adjust clusters to fit data better
5 Repeat until stable Converge to best cluster assignments
6 Output final clusters Done! Use results

Behind the Scenes: What’s Being Minimized?

                <p>

K-Means tries to minimize the sum of squared distances between each point and the centroid of its assigned cluster:

[
text{Minimize} quad sum_{i=1}^{k} sum_{x in C_i} | x – mu_i |^2
]
Where:

  • ( C_i ) is the set of points in cluster ( i )
  • ( mu_i ) is the centroid (mean) of cluster ( i )
                <h2>Mathematical Foundation of K-Means Clustering</h2>

Objective Function

K-Means clustering aims to partition data into ( K ) clusters such that the sum of squared distances from each data point to the centroid of its assigned cluster is minimized. The objective function is:

[
text{Minimize} quad J = sum_{i=1}^{K} sum_{x in C_i} | x – mu_i |^2
]
Where:

  • ( C_i ) is the set of points in cluster ( i )
  • ( mu_i ) is the centroid (mean) of cluster ( i )

Euclidean Distance

The Euclidean distance between a point ( x ) and a centroid ( mu ) is given by:

[
| x – mu | = sqrt{ sum_{j=1}^{n} (x_j – mu_j)^2 }
]

To simplify computation, K-Means minimizes the squared form of this distance:

[
| x – mu |^2 = sum_{j=1}^{n} (x_j – mu_j)^2
]

Algorithm Steps

  1. Initialization: Select ( K ) centroids randomly.
  2. Assignment: Assign each point to the nearest centroid.
  3. Update: Recalculate centroids as the mean of the points in each cluster.
  4. Repeat: Iterate steps 2 and 3 until centroids no longer change.

Optimization Method

K-Means uses a coordinate descent approach to minimize the objective function. It converges to a local minimum. To improve initialization and avoid poor local minima, the KMeans++ algorithm is commonly used.

Computational Complexity

  • Each iteration: ( O(nK) )
  • Number of iterations: typically small (e.g., < 100)
  • Highly scalable to large datasets

Summary Table

Concept Description
Objective Minimize within-cluster variance
Distance Squared Euclidean distance ( | x – mu |^2 )
Update Recalculate centroids as cluster means
Convergence Local minimum (not guaranteed global)
Initialization Sensitive; KMeans++ improves reliability
    <h2>Major Advantages of K-Means Clustering</h2><strong>Simplicity and Efficiency</strong><br />K-Means is easy to implement and understand. It converges quickly on small to medium-sized datasets.

Scalability
It performs well on large datasets, with time complexity that scales linearly with the number of data points, clusters, and iterations.

Unsupervised Learning
K-Means doesn’t require labeled data, making it useful for exploratory data analysis.

Guaranteed Convergence
The algorithm converges in a finite number of iterations by continuously reducing the within-cluster variance.

Interpretability
Clusters and their centroids are straightforward to visualize and interpret, especially in 2D or 3D space.

Major Challenges of K-Means Clustering

Fixed Number of Clusters (K)
The number of clusters must be predefined, which may not be known in advance.

Sensitive to Initialization
The final output depends on the initial placement of centroids, which can lead to different results.

Assumes Spherical Clusters
K-Means works best when clusters are spherical and equally sized. It struggles with elongated or unevenly sized clusters.

Sensitive to Outliers
Outliers can skew the centroids, leading to poor clustering performance.

Not Suitable for Non-Convex Clusters
K-Means cannot accurately capture complex cluster shapes such as concentric circles or other non-linear structures.

Hard Assignment
Each data point is assigned to exactly one cluster, which can be limiting for data points that lie on the boundaries.

Common Use Cases of K-Means Clustering

Customer Segmentation
Used in marketing to group customers by behavior, preferences, or demographics.

Market Basket Analysis
Identifies groups of products that are frequently bought together by segmenting transaction data.

Image Compression
Reduces the number of colors in an image by clustering similar pixel values.

Document Clustering
Groups articles or documents based on textual similarity after converting text to numerical features.

Anomaly Detection
Detects unusual data points that do not fit into any cluster well.

Computer Vision
Used in image segmentation and object detection tasks.

Social Network Analysis
Identifies communities or clusters of users with similar interaction patterns.

Let me know if you’d like this turned into HTML or formatted for a presentation or blog.

Python Example: K-Means Clustering on Iris Dataset

        <pre data-line="">
            <code readonly="true">
                <xmp>import numpy as np

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Load the real dataset

iris = load_iris()
X = iris.data

Standardize the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Apply KMeans clustering

kmeans = KMeans(n_clusters=3, init=’k-means++’, n_init=10, max_iter=300, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)

Mobile-friendly plot configuration

plt.figure(figsize=(5, 4)) # smaller figure size for mobile
plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=30, label=’Cluster 1′)
plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=30, label=’Cluster 2′)
plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=30, label=’Cluster 3′)

Plot centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=80, c=’black’, marker=’X’, label=’Centroids’)

Customize labels for readability on small screens

plt.title(“K-Means Clustering (Iris)”, fontsize=10)
plt.xlabel(“Feature 1 (scaled)”, fontsize=9)
plt.ylabel(“Feature 2 (scaled)”, fontsize=9)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.legend(fontsize=8)
plt.tight_layout() # ensures nothing is clipped
plt.grid(True)
plt.show()

How K-Means Clustering Has Been Applied to the Iris Dataset

The Iris dataset, a classic dataset in machine learning, contains 150 samples of iris flowers, each described by four features: sepal length, sepal width, petal length, and petal width. The aim of applying K-Means clustering to this dataset is to discover natural groupings (clusters) of flowers based on these features—without using the actual species labels (i.e., without supervision).

Step-by-Step Application

1. Data Loading
The first step is to load the Iris dataset. This dataset comes with scikit-learn and provides a convenient way to access both the feature values and species names, although the clustering process does not use the labels.

2. Data Preprocessing
Before clustering, the features are standardized using a technique called feature scaling. Since K-Means uses Euclidean distance to measure how close a data point is to a cluster center, it’s important that all features contribute equally to the result. Standardization ensures that each feature has the same scale (mean 0, standard deviation 1), which helps K-Means perform more accurately.

3. Choosing the Number of Clusters (K)
K is set to 3 because we expect to find three natural clusters in the data—corresponding (though not explicitly) to the three iris species: Setosa, Versicolor, and Virginica.

4. Applying the K-Means Algorithm
The K-Means algorithm is then applied to the standardized dataset. Initially, it randomly chooses three centroids (cluster centers) and begins iteratively updating them. It does this through two key steps:

  • Assignment Step: Each data point is assigned to the nearest centroid, forming temporary clusters.

  • Update Step: For each cluster, the centroid is recalculated as the mean of all points assigned to it.

These steps repeat until the centroids stop changing significantly—meaning the algorithm has converged.

5. Cluster Prediction
Once the model converges, each data point has been assigned to one of the three clusters. These assignments reflect groupings of flowers that are similar in their measurements.

6. Visualization
To make the results interpretable, the first two features (typically sepal length and sepal width) are plotted in a two-dimensional scatter plot. Each point is colored based on the cluster it belongs to, and the centroids are marked for reference. Although clustering is performed in four dimensions, this 2D visualization helps us understand the overall structure.