t-SNE Algorithm Explained with Python

The t-SNE algorithm (short for t-distributed Stochastic Neighbor Embedding) is a powerful and widely used unsupervised machine learning technique for visualizing high-dimensional data in 2D or 3D space. Unlike traditional clustering or classification algorithms, t-SNE is designed purely for visualization. It captures complex, non-linear relationships in the data and reveals hidden patterns by projecting high-dimensional points into a space that is easy to interpret.

t-SNE is especially effective when working with datasets like word embeddings, image features, or deep learning representations, where understanding the structure visually can guide analysis or decision-making. By preserving the local similarity between data points, the t-SNE algorithm creates maps where similar points stay close together, often forming clear, interpretable clusters.

In this guide, we’ll explain how the t-SNE algorithm works, explore its mathematical foundation, highlight its advantages and limitations, and walk through a practical implementation using Python and scikit-learn.

What is t-SNE Algorithm?

t-SNE is a non-linear dimensionality reduction technique specifically designed for visualizing data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

Originally introduced by Laurens van der Maaten and Geoffrey Hinton, t-SNE is widely used for exploring patterns in datasets like images, word embeddings, gene expression data, and more.

Key Features of t-SNE Algorithm

Dimensionality Reduction: Reduces data from high-dimensional space to 2D or 3D for visualization
Preserves Local Structure: Keeps similar data points close to each other
Probabilistic Mapping: Models pairwise similarity in terms of conditional probabilities
Non-Linear: Captures complex relationships better than linear methods like PCA
Visualization-Focused: Not designed for general-purpose feature extraction or downstream ML tasks

How t-SNE Algorithm Works

The core idea of t-SNE is to transform the pairwise similarities between high-dimensional data points into probabilities and then find a lower-dimensional representation that preserves those similarities as closely as possible.

Step 1: Compute pairwise similarities in the high-dimensional space using Gaussian distributions
Step 2: Map these similarities into the low-dimensional space using t-distributions, which have heavier tails
Step 3: Minimize the Kullback-Leibler divergence (KL divergence) between the high-dimensional and low-dimensional distributions using gradient descent
Step 4: Repeat until the embedded coordinates stabilize and capture the local structure of the data

The Math Behind t-SNE Algorithm: Step-by-Step Explanation of Key Equations

1. High-Dimensional Similarity ( p_{ij} )

Step 1.1: Conditional Probability

For data points ( x_i, x_j in mathbb{R}^D ):

[
p_{j|i} = frac{
expleft( -frac{ | x_i – x_j |^2 }{ 2sigma_i^2 } right)
}{
sum_{k ne i} expleft( -frac{ | x_i – x_k |^2 }{ 2sigma_i^2 } right)
}
]

Numerator: Assigns higher values to closer points ( x_j ) relative to ( x_i ).
Denominator: Normalizes over all other points ( k ne i ).
( sigma_i ): Bandwidth selected via binary search to match the desired
perplexity
( mathrm{Perp}(P_i) = 2^{H(P_i)} ), where entropy
( H(P_i) = -sum_j p_{j|i} log_2 p_{j|i} ).

Step 1.2: Symmetrized Joint Probability

[
p_{ij} = frac{p_{j|i} + p_{i|j}}{2n}
]

Ensures that ( p_{ij} = p_{ji} ) (symmetric).
Scales by ( 2n ) for probabilistic normalization:

[
sum_{i ne j} p_{ij} = 1
]

Intuition: Captures the average mutual neighborhood affinity between ( x_i ) and ( x_j ).

2. Low-Dimensional Similarity ( q_{ij} )

In the embedding space ( y_i, y_j in mathbb{R}^d ), typically ( d = 2 ) or ( 3 ):

[
q_{ij} = frac{(1 + | y_i – y_j |^2)^{-1}}{sum_{k ne l} (1 + | y_k – y_l |^2)^{-1}}
]

The numerator computes a Cauchy-like similarity (Student’s t-distribution with 1 DOF).
The denominator ensures:

[
sum_{i ne j} q_{ij} = 1
]

Why t-distribution? It spreads dissimilar points further apart due to its heavy tails,
overcoming the crowding problem (where many high-D points collapse into a small region in low-D).

3. Cost Function (KL Divergence)

[
mathcal{C} = KL(P | Q) = sum_{i ne j} p_{ij} log left( frac{p_{ij}}{q_{ij}} right)
]

Penalizes large discrepancies between the two distributions.
Asymmetric: KL divergence focuses more on preserving high ( p_{ij} ) values—i.e., local neighborhoods.

If ( p_{ij} ) is high (i.e., ( x_i ) and ( x_j ) are close in original space), but ( q_{ij} ) is small, this adds a large penalty.

4. Gradient Descent Update Rule

To minimize ( mathcal{C} ), t-SNE performs gradient descent:

[
frac{partial mathcal{C}}{partial y_i} = 4 sum_{j ne i} (p_{ij} – q_{ij})(y_i – y_j)(1 + | y_i – y_j |^2)^{-1}
]

Term-by-Term Interpretation:

( (p_{ij} – q_{ij}) ): Attraction if positive, repulsion if negative.
( y_i – y_j ): Direction vector from ( y_j ) to ( y_i ).
( (1 + | y_i – y_j |^2)^{-1} ): Gradient of the t-distribution kernel.

This creates attractive forces for similar points and
repulsive forces for dissimilar ones.

Advantages of t-SNE Algorithm

Excellent for visualizing complex, high-dimensional data
Captures non-linear relationships and local structure
No assumption about the shape of the data
Widely used in deep learning, NLP, genomics, and image analysis
Can reveal clusters, subgroups, or outliers even in raw, unstructured data

Limitations of t-SNE Algorithm

Not suitable for clustering or classification tasks
Computationally expensive, especially with large datasets
Results are not deterministic unless random_state is fixed
Difficult to interpret exact distances and axes in the embedded space
Not ideal for preserving global structure — only local relationships are retained

Common Applications of t-SNE Algorithm

Visualizing word embeddings like Word2Vec or BERT
Exploring hidden layers in deep learning models
Analyzing image datasets such as MNIST or CIFAR
Understanding gene expression patterns in biology
Reducing dimensions in high-dimensional feature sets for inspection

t-SNE in Python with Real Dataset (Digits Dataset)

Let’s apply t-SNE Algorithm to the Digits dataset from scikit-learn. This dataset contains 1,797 8×8 images of handwritten digits. t-SNE will help us visualize how these digits are distributed in 2D space.

				
					from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load digits dataset
digits = load_digits()
X = digits.data
y = digits.target
# Standardize the data
X_scaled = StandardScaler().fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Plot the t-SNE result
plt.figure(figsize=(6, 4))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=15)
plt.title("t-SNE Visualization of Digits Dataset", fontsize=10)
plt.xlabel("t-SNE Feature 1", fontsize=9)
plt.ylabel("t-SNE Feature 2", fontsize=9)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.grid(True)
plt.tight_layout()
plt.colorbar(scatter, label='Digit Label')
plt.show()

Interpreting the Output

Each point represents a digit image embedded into 2D space.
Colors indicate the true digit label (0–9), showing natural clusters.
Points from the same digit class tend to group together, revealing that the high-dimensional structure was preserved locally.
Outliers or ambiguous points appear at the boundaries between clusters.
Although t-SNE doesn’t produce labels, it helps in exploring separability and structure visually.

Conclusion

The t-SNE algorithm is a powerful tool for visualising and exploring high-dimensional data in two- or three-dimensional spaces. Widely embraced in the fields of data science and machine learning, the t-SNE algorithm excels at uncovering latent structures, clusters, and relationships that are often obscured in raw feature spaces. While it is not suited for predictive modeling or direct feature engineering, its strength lies in intuitive data interpretation and insightful visual storytelling.

Use the t-SNE algorithm when you need to explore embeddings, assess clustering quality, or present complex data in a way that’s easy to understand and communicate to both technical and non-technical audiences.

References and Further Reading

Wikipedia: t-distributed stochastic neighbor embedding

Scikit-Learn: TSNE

Top FAQs About the t-SNE Algorithm

What is the t-SNE algorithm used for?
The t-SNE algorithm is primarily used for visualizing high-dimensional data in 2D or 3D space, helping reveal clusters, groupings, and patterns that may not be visible through other techniques.
Is the t-SNE algorithm suitable for large datasets?
While t-SNE can handle moderate-sized datasets well, it may become computationally expensive with very large datasets. Techniques like Barnes-Hut t-SNE or UMAP are sometimes preferred for larger data.
Can I use the t-SNE algorithm for classification tasks?
No, the t-SNE algorithm is not designed for classification or predictive modeling. It’s a non-linear dimensionality reduction method used purely for data visualization and exploration.
How does the t-SNE algorithm differ from PCA?
Unlike PCA, which is linear and focuses on preserving global variance, the t-SNE algorithm is non-linear and prioritizes preserving local structure and similarities in the data.
What are the key parameters of the t-SNE algorithm?
Important parameters include perplexity, learning rate, and number of iterations. These control the balance between local and global data structure and the convergence speed.
Is the output of the t-SNE algorithm deterministic?
No, the output can vary between runs unless you fix the random seed. This stochastic nature means different visualizations can be generated from the same input data.
When should I avoid using the t-SNE algorithm?
Avoid using t-SNE when you need reproducibility, feature transformation, or downstream modeling. Also, it’s less effective with noisy or sparse data.
Does the t-SNE algorithm work well with embeddings from neural networks?
Yes, t-SNE is often used to visualize embeddings from models like Word2Vec, BERT, or CNNs to assess how well the model captures relationships between data points.
Can I interpret distances in t-SNE plots quantitatively?
Not exactly. While nearby points indicate similarity, distances and global structures should not be interpreted quantitatively, as the scale is arbitrary and non-linear.
What are some alternatives to the t-SNE algorithm?
Alternatives include UMAP (Uniform Manifold Approximation and Projection), PCA (Principal Component Analysis), and Isomap, depending on the specific needs for speed, interpretability, or scalability.
How does the t-SNE algorithm fit into unsupervised machine learning?
The t-SNE algorithm is a classic example of an unsupervised machine learning method. It doesn’t rely on labeled data but instead analyzes similarities between data points to reduce dimensionality and visualize patterns. It’s particularly useful for understanding structure in high-dimensional datasets and validating clusters or embeddings generated by other algorithms.

To learn more about unsupervised machine learning, including clustering, dimensionality reduction, and real-world applications, check out our complete guide:

👉 Unsupervised Machine Learning: A Comprehensive Guide for Beginners and Experts