Unsupervised Learning: Clustering and Dimensionality Reduction

Imagine you're handed a hard drive containing five years of customer transactions, ten thousand product SKUs, a million sensor readings from a factory floor, and a database of patient health records, and nobody has labeled any of it. No categories. No classes. No "this customer is a loyalist" or "this reading is an anomaly." Just raw numbers staring back at you, waiting for someone to make sense of them.
This is the reality of most machine learning work in the wild. In textbooks and tutorials, we love labeled datasets. They're clean, pre-packaged, and come with a tidy "correct answer" column we can train toward. But in practice, labeling data at scale is expensive, slow, and sometimes impossible, you can't hire a team of experts to categorize a million sensor readings by hand. Even when you have labels for a fraction of your data, the unlabeled portion can be larger by orders of magnitude. And sometimes you genuinely don't know what categories exist yet, you need the data to tell you what patterns matter before you can even decide what to label.
That's not a limitation. That's an invitation. Unsupervised learning is the discipline of finding structure in data when you have no ground truth to guide you. Done right, it turns chaos into clarity. It surfaces customer segments that your marketing team never thought to look for. It compresses a hundred-feature dataset down to five meaningful dimensions that actually explain customer behavior. It flags the sensor readings that look nothing like anything you've seen before, the ones that might be predicting a machine failure three days before it happens. We are not just organizing data when we do unsupervised learning; we are discovering the hidden geometry of real-world phenomena, and that is one of the most powerful things you can do with a dataset.
In this article, we'll build from first principles through a complete customer segmentation workflow. By the time you finish, you'll understand K-Means, DBSCAN, hierarchical clustering, PCA, t-SNE, and UMAP, not just how to call them in scikit-learn, but why they work, when to reach for each one, and how to evaluate results when there's no "correct answer" to check against.
Table of Contents
- Why Unsupervised Learning Matters
- When Unsupervised Learning Shines
- K-Means Clustering: The Foundation
- The Elbow Method
- K-Means Intuition
- Silhouette Score
- DBSCAN: Clustering Without Specifying K
- Hierarchical Clustering and Dendrograms
- Dimensionality Reduction: PCA Essentials
- PCA: Compression as Understanding
- t-SNE: Beautiful Non-Linear Visualization
- UMAP: Fast, Structure-Preserving Reduction
- Complete Workflow: Customer Segmentation Case Study
- Evaluation Metrics: Beyond the Visuals
- Choosing the Right Algorithm
- Key Takeaways
- Wrapping Up
Why Unsupervised Learning Matters
Without labels, supervised learning is impossible. But unsupervised learning thrives in ambiguity. You might want to:
- Segment customers into behavior groups for targeted marketing
- Detect anomalies in production systems or financial fraud
- Compress images or text while keeping meaningful information
- Explore structure when you don't know what you're looking for
- Reduce computational load by keeping only important features
The challenge? There's no ground truth to validate against. A clustering solution isn't "right" or "wrong", it's useful or not useful for your business goal. That distinction forces us to think carefully about evaluation metrics and interpretability.
When Unsupervised Learning Shines
Supervised learning gets most of the glory, but unsupervised methods earn their keep in situations where supervised simply can't operate. The first and most obvious case is when you have no labels at all, but that barely scratches the surface of when unsupervised approaches are the right tool.
Consider anomaly detection. You want to flag fraudulent credit card transactions, but you have very few confirmed fraud examples compared to millions of legitimate ones. A supervised classifier trained on that imbalance will learn to predict "not fraud" almost every time and still achieve 99.9% accuracy. Unsupervised methods sidestep this entirely: they learn the geometry of normal transactions, then flag anything that falls outside that normal space. No labels required, and no class imbalance problem to wrestle with.
Dimensionality reduction shines in exploratory data analysis. When you're handed a dataset with fifty features, you genuinely don't know which ones matter or how they relate to each other. Running PCA in the first five minutes of exploration will tell you whether those fifty features are actually encoding two or three underlying patterns, a discovery that reshapes every subsequent analysis decision you make.
Clustering is indispensable for discovery tasks where the categories themselves are unknown. In genomics, researchers cluster gene expression profiles to find groups of genes that behave similarly, leading to new hypotheses about biological function. In marketing, clustering purchase behavior reveals customer archetypes that the business can then build strategies around. In manufacturing, clustering sensor readings during normal operation establishes a baseline, deviations from cluster membership become early warning signals. The insight in all these cases doesn't come from a labeled training set. It comes from the data organizing itself.
Finally, unsupervised methods are increasingly used as preprocessing for supervised learning. Reducing a hundred features to twenty principal components before training a classifier often improves generalization, reduces training time, and makes the resulting model easier to interpret. The unsupervised step and the supervised step work together, each making the other more effective.
K-Means Clustering: The Foundation
K-Means is the Hello World of clustering. It's simple, fast, and teaches you the core concepts you'll apply everywhere.
The algorithm: Randomly initialize K cluster centers. Assign each point to the nearest center. Move centers to the mean of their assigned points. Repeat until convergence. Done.
Before we touch code, it's worth internalizing why this simple loop works. The assignment step minimizes within-cluster variance given fixed centers. The update step finds the optimal centers given fixed assignments. By alternating between these two optimizations, the algorithm descends toward a local minimum of total within-cluster variance. It won't always find the global minimum, that's why we run it multiple times with different random initializations (n_init=10) and keep the best result. But in practice, this simple alternation converges reliably to useful solutions on most real-world data.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.6, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)
# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, edgecolors='black', linewidth=2)
plt.title('K-Means Clustering Results')
plt.show()The red X's are cluster centers. Each point belongs to the nearest center. Simple and effective, but how do you know if 4 clusters is actually right? This is where the real craft begins, because choosing K is less a mathematical problem than a business judgment call informed by statistical evidence.
The Elbow Method
K-Means optimizes inertia, the sum of squared distances from each point to its assigned center. Inertia always decreases as you add more clusters (eventually, K=N gives zero inertia). The trick is finding where adding more clusters gives diminishing returns.
The elbow method works because of a simple principle: if your data genuinely has K natural clusters, going from K-1 to K clusters produces a dramatic drop in inertia as you properly separate the distinct groups. But going from K to K+1 only makes a small improvement, because you're now splitting a genuine cluster rather than separating two distinct groups. That "elbow" in the inertia curve marks the natural number of clusters in your data, though real data often produces a gradual curve rather than a clean elbow, which is why we pair this visual method with the quantitative silhouette score.
inertias = []
silhouette_scores = []
K_range = range(2, 11)
from sklearn.metrics import silhouette_score
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
silhouette_scores.append(silhouette_score(X, km.labels_))
# Plot the elbow
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (K)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax1.grid(True, alpha=0.3)
ax2.plot(K_range, silhouette_scores, 'go-', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Clusters (K)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Scores by K')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()You're looking for the elbow, the point where inertia stops dropping dramatically. This is where domain knowledge meets statistics. A business might prefer 4 segments even if 5 clusters has slightly better inertia, because 4 is easier to action. The mathematical optimum and the operationally useful answer are not always the same thing, and the best data scientists know when to prioritize the latter.
K-Means Intuition
To really understand K-Means, not just run it, but trust it, you need to internalize what it assumes about the shape of your data. K-Means measures similarity using Euclidean distance, which means it fundamentally assumes that clusters are roughly spherical and similarly sized. If your true clusters are elongated crescents, concentric rings, or dramatically different in scale, K-Means will carve the space incorrectly no matter how carefully you choose K.
This is not a flaw, it's a design choice that enables speed. K-Means runs in O(n × K × iterations) time, making it practical even on datasets with millions of points. The spherical cluster assumption is the price you pay for that scalability, and it's a reasonable price for many real-world problems. Customer segments based on purchasing behavior tend to form relatively compact clouds in feature space. User cohorts defined by engagement metrics cluster in roughly convex regions. The assumption holds well enough that K-Means remains the first algorithm most practitioners reach for.
The practical implication is that you should always visualize your K-Means results rather than trusting the metrics alone. Reduce to two dimensions with PCA, plot the assigned clusters, and look for obvious splits across what should be a single natural group, or obvious merges across what look like two distinct populations. Your eyes will catch failures that silhouette scores miss. When K-Means is working well, the clusters in your PCA plot will look roughly circular and clearly separated. When it's struggling, you'll see elongated assignments and boundary regions where points look like they belong to the wrong group.
One more practical note: always scale your features before running K-Means. Because the algorithm uses Euclidean distance, a feature measured in thousands of dollars will dominate the clustering over a feature measured in years of tenure, even if the latter is substantively more important. StandardScaler removes this problem by putting every feature on the same scale before the distance calculations begin.
Silhouette Score
The silhouette score measures cluster cohesion and separation. For each point, it calculates how similar it is to points in its own cluster versus points in other clusters. Scores range from -1 (bad assignment) to +1 (excellent).
The formula is elegant: for each point, compute a (the average distance to other points in the same cluster) and b (the average distance to points in the nearest different cluster). The silhouette value is (b - a) / max(a, b). A point with a high silhouette score is both close to its cluster-mates and far from the nearest alternative cluster, it is unambiguously assigned. A point with a score near zero sits at a cluster boundary. A negative score means the point is actually closer to a neighboring cluster than to its own, which is a sign that either K is wrong or that point is genuinely ambiguous.
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm
# For K=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)
silhouette_vals = silhouette_samples(X, cluster_labels)
fig, ax = plt.subplots(figsize=(10, 6))
y_lower = 10
for i in range(4):
cluster_silhouette_vals = silhouette_vals[cluster_labels == i]
cluster_silhouette_vals.sort()
size_cluster_i = cluster_silhouette_vals.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / 4)
ax.fill_betweenx(np.arange(y_lower, y_upper),
0, cluster_silhouette_vals,
facecolor=color, edgecolor=color, alpha=0.7)
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax.set_title('Silhouette Plot for K=4')
ax.set_xlabel('Silhouette Coefficient')
ax.set_ylabel('Cluster Label')
plt.show()A silhouette plot shows you which points are well-assigned and which are ambiguous. Negative values mean a point might belong to a different cluster, they're boundary cases worth investigating. In a well-clustered dataset, all bars in the silhouette plot will extend well past the mean silhouette line, and none will dip below zero. If certain clusters have many negative values, that's a strong signal to reconsider K or try a different algorithm entirely.
DBSCAN: Clustering Without Specifying K
K-Means forces you to specify the number of clusters upfront. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) learns the number of clusters from data itself.
The idea: Points that are close together in dense regions should be in the same cluster. Points in sparse regions are noise.
DBSCAN defines clusters as contiguous high-density regions separated by low-density gaps. A point is a "core point" if at least min_samples other points fall within distance eps of it. Core points and any points reachable from them (even through a chain of other core points) form a single cluster. Points that are not reachable from any core point are labeled as noise, receiving a label of -1. This density-based definition handles arbitrary cluster shapes naturally, if you have two interlocking crescents, DBSCAN will find them. K-Means would split each crescent down the middle.
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Scale features (DBSCAN is distance-sensitive)
X_scaled = StandardScaler().fit_transform(X)
# Fit DBSCAN
dbscan = DBSCAN(eps=0.4, min_samples=5)
db_labels = dbscan.fit_predict(X_scaled)
# Visualize
plt.figure(figsize=(10, 6))
unique_labels = set(db_labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
if label == -1:
# Noise points in black
color = 'black'
marker = 'x'
else:
marker = 'o'
class_member_mask = (db_labels == label)
xy = X_scaled[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c=[color], marker=marker,
s=100, alpha=0.6, edgecolors='k')
plt.title('DBSCAN Clustering Results')
plt.show()
print(f"Number of clusters: {len(set(db_labels)) - (1 if -1 in db_labels else 0)}")
print(f"Number of noise points: {list(db_labels).count(-1)}")DBSCAN found its own clusters and identified noise points (marked with X). No K to choose, no assumptions about cluster shape. It handles arbitrary shapes beautifully, K-Means would struggle with non-spherical clusters.
The tradeoff? You need to tune eps (the radius within which to search for neighbors) and min_samples (how many neighbors make a core point). A practical starting point for eps is to compute the k-nearest-neighbor distances for each point (using k = min_samples), sort them, and look for the "knee" in that curve, the distance at which the sorted distances start increasing steeply. That knee is a natural eps choice. For min_samples, a good default is 2 × number of features, with a minimum of 5. DBSCAN also struggles when clusters have dramatically different densities, because a single eps value can't simultaneously capture a tight cluster and a diffuse one.
Hierarchical Clustering and Dendrograms
Sometimes you don't want a single partition. You want to see the entire hierarchy, how clusters split and merge at different distance thresholds.
The key insight of hierarchical clustering is that it doesn't commit to a single number of clusters. Instead, it builds a tree of all possible clusterings, from every point in its own cluster all the way to all points in one cluster, and lets you choose where to cut the tree afterward. This makes it enormously flexible for exploratory analysis: you can show a dendrogram to a business stakeholder and ask "does this split make sense to you?" rather than defending a fixed K you chose algorithmically.
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
# Compute linkage matrix
linkage_matrix = linkage(X_scaled, method='ward')
# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix, leaf_rotation=90)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram')
plt.tight_layout()
plt.show()A dendrogram is a tree diagram showing how clusters merge. The height of each merge tells you the distance at which two clusters combine. You can "cut" the tree at any height to get a specific number of clusters, or let the data structure guide you. Look for the level where there are long vertical lines before a merge: that indicates a large distance jump, meaning the two groups being merged are genuinely quite different from each other. A cut just before that jump gives you the natural clustering.
Hierarchical clustering is slower (O(n²) or worse) than K-Means but offers interpretability. You see the full relationship structure, not just the final partition. Ward linkage (which minimizes total within-cluster variance at each merge step) tends to produce the most visually clean and intuitively sensible dendrograms for most datasets, and it's a reliable default unless you have specific reasons to prefer other linkage methods.
Dimensionality Reduction: PCA Essentials
You've got 50 features but suspect only a few matter. Dimensionality reduction finds the most important axes in your data.
PCA (Principal Component Analysis) works by:
- Finding the direction of maximum variance (PC1)
- Finding the next perpendicular direction of maximum variance (PC2)
- Repeating until you've captured enough variance
PCA: Compression as Understanding
Before we run the code, it's worth slowing down to appreciate what PCA is actually doing, because the intuition makes everything else click. Every dataset is a cloud of points living in some high-dimensional space, one dimension per feature. PCA asks a deceptively simple question: what is the shape of that cloud? If the cloud is a long thin ellipse, the long axis (PC1) captures most of the variance. PC2 captures the next most variance orthogonal to PC1, and so on.
The revelation is that most real-world datasets are not randomly distributed across all their dimensions. Customer behavior, gene expression, image pixels, financial returns, they all have structure, correlations, redundancies. Ten features might really be encoding two or three underlying patterns, with the rest being noise or linear combinations of the core signals. PCA finds those core signals and discards the noise. When you reduce 50 features to 5 principal components that explain 90% of the variance, you haven't lost 90% of the information, you've discarded 90% of the noise while keeping 90% of the signal. That's a profound practical benefit that goes beyond mere compression.
There's also an interpretability payoff. Principal components often correspond to meaningful real-world factors. In a customer dataset, PC1 might separate high-value frequent buyers from low-value occasional ones. PC2 might separate customers who buy across many categories from those who specialize. You can examine the feature loadings, how much each original feature contributes to each principal component, and often name the component based on which features dominate it. This turns abstract dimensionality reduction into concrete business insight.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Normalize data (PCA is sensitive to scale)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Explained variance
explained_var = pca.explained_variance_ratio_
cumsum_var = np.cumsum(explained_var)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Scree plot
ax1.plot(range(1, len(explained_var) + 1), explained_var, 'bo-', linewidth=2)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio')
ax1.set_title('Scree Plot')
ax1.grid(True, alpha=0.3)
# Cumulative variance
ax2.plot(range(1, len(cumsum_var) + 1), cumsum_var, 'go-', linewidth=2)
ax2.axhline(y=0.95, color='r', linestyle='--', label='95% Threshold')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Explained Variance')
ax2.set_title('Cumulative Explained Variance')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# How many components for 95% variance?
n_components_95 = np.argmax(cumsum_var >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")The scree plot shows how much variance each component explains. The cumulative plot tells you when you've captured enough information. If 2 components explain 95% of variance, you can safely drop the rest and plot your data in 2D. The 95% threshold is a reasonable rule of thumb for most applications, though if you're using PCA as preprocessing for a downstream classifier, you may find that keeping 99% of variance gives you better model performance, always experiment with the threshold rather than treating it as sacred.
t-SNE: Beautiful Non-Linear Visualization
PCA is linear, it finds straight-line combinations of features. For complex, non-linear data, t-SNE (t-Distributed Stochastic Neighbor Embedding) often produces more interpretable visualizations.
The core idea behind t-SNE is to preserve neighborhood relationships rather than global variance. For every point, t-SNE computes a probability distribution over its neighbors in high-dimensional space (using a Gaussian kernel), then tries to reproduce those same neighborhood probabilities in two dimensions. Points that are close in high-dimensional space will be pulled together in the 2D layout. Points that are far apart will be pushed apart. The result is a visualization that faithfully represents local structure, which clusters are tight, which are loose, which points sit on the boundary between groups.
from sklearn.manifold import TSNE
# t-SNE is computationally expensive; use a subset for exploration
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=clusters,
cmap='viridis', alpha=0.6, s=50)
plt.colorbar(scatter, label='Cluster')
plt.title('t-SNE Visualization of Clusters')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()t-SNE preserves local structure beautifully. Points that are similar in high-dimensional space end up close together in 2D. But be careful: t-SNE distorts global distances and is non-deterministic. Use it for exploration and visualization, not for downstream analysis. Two clusters that appear close in a t-SNE plot may actually be quite different in the original feature space, the algorithm intentionally compresses distant regions to make local structure legible. Always use t-SNE alongside quantitative metrics, never instead of them.
UMAP: Fast, Structure-Preserving Reduction
UMAP (Uniform Manifold Approximation and Projection) is newer but increasingly popular. It's faster than t-SNE, preserves more global structure, and works in any number of dimensions.
UMAP is built on solid mathematical foundations from topology and Riemannian geometry, but you don't need to understand the theory to use it effectively. The practical story is this: UMAP constructs a fuzzy topological representation of the high-dimensional data, then optimizes a low-dimensional layout to match that representation as closely as possible. The result looks similar to t-SNE but with better preservation of the relative distances between clusters, faster runtime on large datasets, and the ability to reduce to more than two dimensions (useful when you want 5D for downstream analysis, not just 2D for visualization).
# pip install umap-learn
from umap import UMAP
umap = UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
X_umap = umap.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=clusters,
cmap='viridis', alpha=0.6, s=50)
plt.colorbar(scatter, label='Cluster')
plt.title('UMAP Visualization of Clusters')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()UMAP is mathematically similar to t-SNE but scales better and preserves more structure. For large datasets or when you need the visualization to inform further analysis, UMAP is often preferable. The n_neighbors parameter controls the balance between local and global structure: small values focus on local neighborhoods and produce tighter, more separated clusters; large values zoom out and preserve broader structural relationships between groups. Start with 15 and adjust based on what you see.
Complete Workflow: Customer Segmentation Case Study
Let's tie it together. You've got customer transaction data, purchase amounts, frequency, recency, product categories. The goal: segment customers into actionable groups.
This is the most important section of the article, because it shows you how the individual tools combine into a coherent analytical workflow. In practice, you never just run K-Means and call it done. You preprocess carefully, choose K empirically, validate with multiple metrics, reduce to 2D for communication, and then, this step is crucial, actually look at what the segments mean in business terms. A cluster with high recency, high frequency, and high monetary value is your champion segment. Nurture them. A cluster with high recency but low frequency might be new customers you can convert to loyalists with the right campaign. The math finds the segments; the domain knowledge names them and determines what to do next.
import pandas as pd
# Simulate customer data
np.random.seed(42)
n_customers = 500
customer_data = pd.DataFrame({
'recency_days': np.random.randint(1, 365, n_customers),
'frequency': np.random.exponential(scale=5, size=n_customers),
'monetary_value': np.random.exponential(scale=100, size=n_customers),
'product_diversity': np.random.randint(1, 20, n_customers),
'avg_order_value': np.random.exponential(scale=50, size=n_customers)
})
# Normalize
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_customer = scaler.fit_transform(customer_data)
# Find optimal K using silhouette
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 8):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_customer)
score = silhouette_score(X_customer, km.labels_)
silhouette_scores.append(score)
print(f"K={k}: Silhouette = {score:.3f}")
optimal_k = np.argmax(silhouette_scores) + 2
print(f"\nOptimal K: {optimal_k}")
# Fit final model
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
customer_data['segment'] = kmeans_final.fit_predict(X_customer)
# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_pca_2d = pca.fit_transform(X_customer)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1],
c=customer_data['segment'], cmap='Set3',
alpha=0.6, s=50)
plt.colorbar(scatter, label='Segment')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)')
plt.title('Customer Segments (PCA Visualization)')
plt.show()
# Analyze segments
for seg in range(optimal_k):
segment_data = customer_data[customer_data['segment'] == seg]
print(f"\nSegment {seg} (n={len(segment_data)}):")
print(segment_data.describe().loc[['mean', '50%', 'std']])You now have customer segments with interpretable characteristics. Segment 0 might be high-value, frequent buyers (champions). Segment 1 could be one-time purchasers (need nurturing). Business teams can build targeted strategies for each. Notice that we used PCA purely for visualization here, the clustering itself was done in the full five-dimensional feature space, where all the information lives. PCA's job in this workflow is communication, not computation.
Evaluation Metrics: Beyond the Visuals
Silhouette score is great, but here's the fuller picture:
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
# For the final segmentation
labels = customer_data['segment']
# Silhouette: -1 to 1, higher is better
sil = silhouette_score(X_customer, labels)
# Calinski-Harabasz: higher is better (ratio of between-cluster to within-cluster variance)
ch = calinski_harabasz_score(X_customer, labels)
# Davies-Bouldin: lower is better (average similarity of each cluster with its most similar cluster)
db = davies_bouldin_score(X_customer, labels)
print(f"Silhouette Score: {sil:.3f}")
print(f"Calinski-Harabasz Index: {ch:.1f}")
print(f"Davies-Bouldin Index: {db:.3f}")No single metric is truth. Use multiple angles: silhouette for cohesion, Calinski-Harabasz for balance, Davies-Bouldin for separation. Then talk to domain experts about whether the clusters make business sense. A clustering with slightly worse silhouette scores but four interpretable, actionable segments is more valuable than a mathematically optimal solution that produces groups no one can explain or act on. Evaluation is always a conversation between statistics and domain knowledge.
Choosing the Right Algorithm
One of the most practical questions in unsupervised learning is deceptively simple: given a new dataset, where do you start? The answer depends on what you know about your data and what you're trying to accomplish.
Start with K-Means when you have a rough sense of how many groups to expect, when your data is reasonably high-dimensional and roughly globular in its cluster shapes, and when you need a fast answer on a large dataset. K-Means scales to millions of points and runs in seconds. Its limitations, the need to specify K, the assumption of spherical clusters, are manageable with the elbow method and silhouette analysis. For most business analytics problems, K-Means is the right first move.
Reach for DBSCAN when you suspect your clusters have irregular shapes, when noise and outliers are present and should be explicitly labeled rather than force-assigned to the nearest centroid, or when you genuinely don't want to specify K in advance. DBSCAN is also the natural choice for anomaly detection tasks where you want to define "normal" as the dense regions and "anomalous" as the sparse points that fall outside. The tuning is more involved, but the outputs are richer.
Use hierarchical clustering when stakeholder communication matters as much as the final partition. The dendrogram is a powerful explainability tool, you can show it to a domain expert and have a conversation about what the splits mean. Hierarchical approaches are also valuable when you want to explore clusterings at multiple granularities: perhaps you want a four-segment view for strategic planning and a twelve-segment view for tactical execution, and hierarchical clustering gives you both from a single model.
For dimensionality reduction, PCA is almost always the right first step, run it on any new dataset to understand how much variance the features actually encode and whether the dimensionality is as high as it appears. Use t-SNE or UMAP for visualization and exploration, especially when you want to show cluster separation to stakeholders or check whether your clustering algorithm is finding visually coherent groups. Reserve UMAP over t-SNE when your dataset exceeds a few thousand points or when you need more than two output dimensions.
Key Takeaways
K-Means is fast and intuitive but assumes spherical clusters and requires specifying K. Use the elbow method and silhouette scores to choose K.
DBSCAN finds clusters organically and handles noise, but requires tuning eps and min_samples. Great for complex shapes.
Hierarchical clustering shows the full structure. Slower but interpretable, especially when you don't know the right granularity upfront.
PCA linearly compresses data while preserving variance. Fast, interpretable, and works well when relationships are roughly linear.
t-SNE and UMAP capture non-linear structure beautifully for visualization. Don't use them for downstream analysis, just for exploration.
Real-world work combines these. Segment with K-Means, reduce to 2D with PCA for plotting, validate with silhouette and domain knowledge. The goal isn't the perfect cluster, it's actionable insights.
Wrapping Up
The absence of labels is not a problem to overcome, it's a feature of the real world that unsupervised learning is designed to embrace. Every dataset you encounter will have more unlabeled data than labeled data. Learning to find structure in that unlabeled space is one of the highest-leverage skills in applied machine learning.
We covered a lot of ground here. K-Means gives you fast, interpretable clusters with well-understood tradeoffs. DBSCAN handles the messy realities of non-spherical clusters and noisy data. Hierarchical clustering offers the full picture through dendrograms. PCA compresses while preserving meaning, and the scree plot teaches you how much dimensionality your data really has. t-SNE and UMAP give you beautiful 2D windows into high-dimensional structure. And the customer segmentation workflow showed you how these tools work together in a coherent pipeline.
The meta-lesson is about evaluation. Supervised learning gives you accuracy, F1, AUC, clean numbers that tell you unambiguously how well you're doing. Unsupervised learning gives you silhouette scores and dendrograms and PCA plots, and asks you to combine those statistical signals with domain knowledge and business judgment. That ambiguity is not a weakness. It's an invitation to think harder about what you're actually trying to discover, and whether your algorithm is helping you find it. The best practitioners in this space are not the ones who know the most algorithms, they're the ones who know how to have the right conversation between the mathematics and the meaning.