A Deep Dive into Clustering for Customer Segmentation – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Gruhesh Sri Sai Karthik Kurra

Introduction

Ever wonder how companies like Netflix recommend movies you’ll love, or how Amazon suggests products you might need? A key technology behind this magic is clustering, a type of unsupervised machine learning.

Think of it like organizing your music library. Without knowing the genres, you could group songs by tempo, instruments, and energy level. Clustering does the same for data, finding hidden patterns and grouping similar items together without any pre-existing labels.

In this post, we’ll take a deep dive into clustering by building a customer segmentation model from scratch. We’ll generate our own dataset and apply four popular clustering algorithms to see which one works best.

What We’ll Cover

Generating Synthetic Customer Data: We’ll create a realistic dataset of customers based on age and income.
Finding the Optimal Number of Clusters: Using the Elbow Method and Silhouette Scores to decide how many customer segments to create.
Applying Four Clustering Algorithms: We’ll implement and compare:
- K-Means
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Models (GMM)
Visualizing and Comparing Results: We’ll create plots to see how each algorithm performed.

Step 1: Generating the Data

First, we need some data. To keep things focused, we’ll generate a synthetic dataset of 300 customers with four distinct segments. This allows us to have “ground truth” labels to evaluate our models against later.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

def generate_customer_data():
    np.random.seed(42)
    # Define 4 customer segments (age_center, income_center)
    cluster_centers = [(25, 30000), (45, 80000), (35, 55000), (55, 100000)]
    cluster_stds = [5, 8, 6, 10]
    all_data = []

    for i, (age_center, income_center) in enumerate(cluster_centers):
        n_samples = 75
        ages = np.random.normal(age_center, cluster_stds[i], n_samples)
        incomes = np.random.normal(income_center, cluster_stds[i] * 1000, n_samples)
        # Combine into a single feature matrix
        cluster_data = np.column_stack([ages, incomes])
        all_data.append(cluster_data)

    X = np.vstack(all_data)
    return X

X_raw = generate_customer_data()

Step 2: The Importance of Scaling

Our features, Age (e.g., 25, 45) and Income (e.g., 30000, 80000), are on vastly different scales. Most clustering algorithms are distance-based, so the Income feature would completely dominate the Age feature.

To fix this, we use StandardScaler from scikit-learn to give both features a mean of 0 and a standard deviation of 1.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

Step 3: How Many Clusters? Finding the Optimal K

This is a critical question in clustering. We’ll use two methods to find the best number of clusters (k):

The Elbow Method: We calculate the inertia (sum of squared distances to the nearest cluster center) for a range of k values. We look for the “elbow” point where the inertia stops decreasing rapidly.
Silhouette Score: This score measures how well-separated the clusters are. A score closer to 1 is better.

The code in the notebook generates the following plots, which help us decide on the best k.

Based on the silhouette score, k=2 is technically optimal for this generated dataset, but for the purpose of demonstrating segmentation, our notebook proceeds with k=4 (our ground truth) for some models. In a real-world scenario, this analysis is crucial.

Step 4: Applying the Clustering Algorithms

With our data ready, we can now apply our clustering algorithms. Here’s a look at how we apply K-Means. The full notebook contains the code for all four algorithms.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# We'll use the optimal k found from our analysis
optimal_k = 4 

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

silhouette_avg = silhouette_score(X_scaled, kmeans_labels)
print(f"K-Means Silhouette Score: {silhouette_avg:.3f}")

We do the same for Hierarchical Clustering, DBSCAN, and GMMs, each with their own strengths. For example, DBSCAN is great at finding non-spherical clusters and identifying outliers, while GMMs can handle clusters that overlap.

Step 5: Visualizing the Results

A picture is worth a thousand words, especially in clustering. We can visualize the results of each algorithm to see how they segmented the customers.

This visualization gives us an immediate sense of which algorithms performed best. We can see how well the clusters match the “True Clusters” and compare their silhouette scores in the table.

Conclusion

We’ve successfully gone through a complete clustering pipeline! We generated data, found the optimal number of clusters, applied four different algorithms, and visualized the results.

This project shows that there’s no single “best” clustering algorithm. The right choice depends on your data and your goals. K-Means is a great starting point, but exploring other methods like DBSCAN or GMMs can often lead to better, more meaningful segments.

To see all the code and dive deeper into the analysis, check out the full project on GitHub and Hugging Face.

Happy clustering!

This content originally appeared on DEV Community and was authored by Gruhesh Sri Sai Karthik Kurra