Simplify Customer Segmentation with PandasAI

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Understanding customer needs is critical for business success. Segmenting customers into groups with common traits allows for targeting products, marketing, and services. This guide will walk through customer segmentation using PandasAI, a Python library that makes the process easy and accessible.

Overview

We'll work through segmenting sample customer data step-by-step using PandasAI's conversational interface. Specifically, we'll cover:

Loading and exploring customer data
Selecting segmentation features
Determining the optimal number of clusters
Performing clustering with PandasAI
Analyzing and describing the segments

Follow along with the explanations and code examples below to gain hands-on experience with customer segmentation in Python.

Introduction

Customer segmentation provides immense value by enabling tailored messaging and product offerings. But for many, the technical complexity makes segmentation impractical. PandasAI removes this barrier by automating the process via simple conversational queries.

In this guide, we'll explore customer segmentation hands-on by working through examples using PandasAI. You'll learn how to load data, determine clusters, perform segmentation, and analyze results. The steps are accessible even without extensive data science expertise. By the end, you'll be equipped to leverage PandasAI's capabilities to unlock deeper customer insights. Let's get started!

Step 1 - Load Customer Data

We'll use a fictional customer dataset customers.csv containing 5000 rows with attributes like demographics, location, transactions, etc. Let's load it with Pandas:

import pandas as pd

customers = pd.read_csv("customers.csv")

Preview the data:

customers.head()

simplify-customer-segmentation-with-pandasai-img-0

This gives us a sense of available attributes for segmentation.

Step 2 - Select Segmentation Features

Now we need to decide which features to use for creating customer groups. For this example, let's select:

Age
Gender
City
Number of Transactions

Extract these into a new DataFrame:

segmentation_features = ['age', 'gender', 'city', 'transactions']
customer_data = customers[segmentation_features]

Step 3 - Determine Optimal Number of Clusters

A key step is choosing the appropriate number of segments k. Too few reduces distinction, too many makes them less useful.

Traditionally, without using PandasAI, we should apply the elbow method to identify the optimal k value for the data. Something like this:

from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
import matplotlib.pyplot as plt

# Handle missing values by imputing with the most frequent value in each column
imputer = SimpleImputer(strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(customers), columns=customers.columns)

# Perform one-hot encoding for the 'gender' and 'city' columns
encoder = OneHotEncoder(sparse=False)
gender_city_encoded = encoder.fit_transform(df_imputed[['gender', 'city']])

# Concatenate the encoded columns with the original DataFrame
df_encoded = pd.concat([df_imputed, pd.DataFrame(gender_city_encoded, columns=encoder.get_feature_names_out(['gender', 'city']))], axis=1)

# Drop the original 'gender' and 'city' columns as they're no longer needed after encoding
df_encoded.drop(columns=['gender', 'city'], inplace=True)

# Calculate SSE for k = 1 to 9
sse = {}
for k in range(1, 9):
    km = KMeans(n_clusters=k)
    km.fit(df_encoded)
    sse[k] = km.inertia_

# Plot elbow curve
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()

simplify-customer-segmentation-with-pandasai-img-1

Examining the elbow point, 4 seems a good number of clusters for this data, so we’ll create 4 clusters.

Too complicated? You can easily let PandasAI do it for you.

customers.chat("What is the ideal amount of clusters for the given dataset?")
# 4

PandasAI will use a silhouette score under the hood to calculate the optimal amount of clusters based on your data.

Silhouette score is a metric used to evaluate the goodness of a clustering model. It measures how well each data point fits within its assigned cluster versus how well it would fit within other clusters.

PandasAI leverages silhouette analysis to pick the optimal number of clusters for k-means segmentation based on which configuration offers the best coherence within clusters and distinction between clusters for the given data.

Step 4 - Perform Clustering with PandasAI

Now we'll use PandasAI to handle clustering based on the elbow method insights.

First import and initialize a SmartDataFrame:

from pandasai import SmartDataframe

sdf = SmartDataframe(customers)

Then simply ask PandasAI to cluster:

segments = sdf.chat("""
    Segment customers into 4 clusters based on their age, gender, city and number of transactions.
""")

This performs k-means clustering and adds the segment labels to the original data. Let's inspect the results:

print(segments)

simplify-customer-segmentation-with-pandasai-img-2

Each customer now has a cluster segment assigned.

Step 5 - Analyze and Describe Clusters

With the clusters created, we can derive insights by analyzing them:

centers = segments.chat("Show cluster centers")

print(centers)

simplify-customer-segmentation-with-pandasai-img-3

Step 6 - Enrich Analysis with Additional Data

Our original dataset contained only a few features. To enhance the analysis, we can join the clustered data with additional customer info like:

Purchase history
Customer lifetime value
Engagement metrics
Product usage
Ratings/reviews

Bringing in other datasets allows drilling down into each segment with a deeper perspective.

For example, we could join the review history and analyze customer satisfaction patterns within each cluster:

# Join purchase data to segmented dataset
enriched_data = pd.merge(segments, reviews, on='id')

# Revenue for each cluster
enriched_data.groupby('cluster').review_score.agg(['mean', 'median', 'count'])

This provides a multidimensional view of our customers and segments, unlocking richer insights and enabling a more in-depth analysis for additional aggregate metrics for each cluster.

Conclusion

In this guide, we worked through segmenting sample customer data step-by-step using PandasAI. The key aspects covered were:

Loading customer data and selecting relevant features
Using the elbow method to determine the optimal number of clusters
Performing k-means clustering via simple PandasAI queries
Analyzing and describing the created segments

Segmentation provides immense value through tailored products and messaging. PandasAI makes the process accessible even without extensive data science expertise. By automating complex tasks through conversation, PandasAI allows you to gain actionable insights from your customer data.

To build on this, additional data like customer lifetime value or engagement metrics could provide even deeper understanding of your customers. The key is asking the right questions – PandasAI handles the hard work to uncover meaningful answers from your data.

Now you're equipped with hands-on experience leveraging PandasAI to simplify customer segmentation in Python.

Author Bio

Gabriele Venturi is a software engineer and entrepreneur who started coding at the young age of 12. Since then, he has launched several projects across gaming, travel, finance, and other spaces - contributing his technical skills to various startups across Europe over the past decade.

Gabriele's true passion lies in leveraging AI advancements to simplify data analysis. This mission led him to create PandasAI, released open source in April 2023. PandasAI integrates large language models into the popular Python data analysis library Pandas. This enables an intuitive conversational interface for exploring data through natural language queries.

By open-sourcing PandasAI, Gabriele aims to share the power of AI with the community and push boundaries in conversational data analytics. He actively contributes as an open-source developer dedicated to advancing what's possible with generative AI.