__STYLES__

Identifying Key Customer Segments [Winner]

Tools used in this project
Identifying Key Customer Segments [Winner]

Power BI - use 80% zoom for optimal full-width viewing

About this project

Goal

The task was to perform as a Senior Marketing Analyst at Maven Cafe, who just run a test by sending different combinations of promotional offers to existing rewards members. Now that the 30-day period for the test has concluded, the task is to identify key customer segments and develop a data-driven strategy for future promotional messaging & targeting. The results should be summarized in a report that will be presented to the CMO.

The data consists of a table with 17'000 unique existing rewards members including their age, income, gender, and date of membership; and another table capturing all events during the 30-day period, such as when offers were sent, viewed and completed, along with all customer transactions with the cafe.

Given the assignment, I decided to create a one-page explanatory dashboard with the key findings in a way that make it easy for the CMO to digest and act upon.

Data Cleaning

The data cleaning process was relatively straightforward and therefore not too time consuming. The main challenge was deciding how to handle customers with missing values, particularly those with an age recorded as 118. I chose to remove these entries from the dataset entirely.

Data Analysis

First, I created bins for age and income to group customers together. After some research, I defined the following age groups: 18-24 (Young Adults), 25-34 (Early Career Professionals), 35-44 (Young Families), 45-54 (Mature Professionals), 55-64 (Pre-Retirement), 65+ (Retirees). For income groups, I categorized customers into four brackets: 30-50k (Lower-Middle Income), 50-75k (Middle Income), 75k-100k (Upper-Middle Income) and 100-120k (Affluent). These groupings provided a solid overview of customer demographics and revenue distribution.

Key Customer Segments

To identify key customer segments, I employed a K-Means clustering algorithm in Python, using the variables age, income, amount purchased, and number of transactions. I standardized the data using StandardScaler and determined the optimal number of clusters by applying the elbow method and evaluating the silhouette score, as demonstrated in the following code:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

features = ['age', 'income', 'Amount Purchased', 'No. of Transactions']
X = df[features]

# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Finding the optimal number of clusters
inertia = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

Looking at both graphs I chose k=4 as optimal number of clusters because it provides the best balance between cluster tightness (inertia) and the clarity of cluster separation (silhouette score).

undefinedIn the next and final step I applied dimensionality reduction for visualization using PCA:

# Choosing the optimal_k based on the Elbow Method or Silhouette Score
optimal_k = 4

# K-Means Clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)

# Dimensionality Reduction for Visualization using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Adding the PCA results to the df for plotting
df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]

Like this I was able to visualize the four customer segments in the scatter plot.

Offer- and Channel-Analysis

Each offer was sent randomly to customers across six events during the 30-day period, covering ten different offers. On average, each was sent around 1'100 times per event, totaling between 6'500 and 6'700 times. The first two events were spaced with a 7-day waiting period, while the remaining four events were sent every 3-4 days. The distribution of offers across age and income groups was relatively even. The majority of customers received 4-5 offers, suggesting a potential dependency between receiving multiple offers and the sequence in which they were sent. However, I assumed each offer was independent and did not explore this potential dependency further. This assumption implies that each offer was sent to a distinct subset of customers, effectively treating each group as a sample. To compare offers accurately, it is therefore essential to use statistical methods to validate the findings.

My objective was to identify which of the eight promotional offers (excluding the two informational offers) performed best. I chose to analyze two metrics: the redemption rate (or completion rate) and the revenue associated with completed offers.

The redemption rate showed a clear trend for two offers, and due to the large sample size, the 95% confidence intervals for proportions were narrow. This suggests that the redemption rates are consistent across the sample, which strenghtens the reliability of the results.

The revenue associated with completed offers showed a small variation, with the highest revenue at $97k USD and the lowest at $77k USD. The 95% confidence intervals for the average revenue associated with completed offers indicated that the three offers with the highest average revenue had significantly non-overlapping confidence intervals with the others (please see the dashboard for the visualization). This suggests a statistically significant difference in the revenue generated per transaction. However, the three have overlapping confidence intervals between themselves. Although performing T-tests to determine the best offer in terms of average revenue would be the next logical step, I excluded this from the current analysis.

Next, I analyzed the effectiveness of different channels. For instance, it might be expected that customers aged 18-35 would be more responsive to offers sent via mobile and social channels. The analysis showed that offers sent through mobile and social platforms had a much higher view rate than those sent via web+mail or web+mail+mobile. Due to the large sample size, the confidence intervals were again narrow. However, as seen in the following picture, contrary to expectations, the view rate was slightly higher among older rewards customers. For the non-web channel on the other hand, the view rate among older rewards customers is lower than for the younger ones.

undefinedFinally, I examined the effects of the 30-day time period on both metrics (redemption rate and revenue). The results indicated that sending offers more frequently increased average daily revenue but did not result in a higher overall redemption rate.

I also considered analyzing the time between receiving an offer and viewing it, as well as the time between viewing and completing it. However, I found that it was possible to complete an order without viewing the offer, and there were instances where the same offer was sent twice to a customer but only viewed once. These cases would skew the results. Additionally, there were instances where customers completed nearly every offer before viewing it, as seen in the following picture. Due to these inconsistencies, I chose not to pursue this analysis further.

undefinedFinal Thoughts: The topic was very interesting and allowed me to refresh my knowledge of statistical methods. @Maven Analytics: I would greatly appreciate a guided project focused on statistics, perhaps based on this dataset. If there is already one available, please let me know. I believe in general many data analysts in your community could benefit a lot from it.

Design

I used a color palette tool to generate the color scheme, selecting various shades of brown to reflect the color of coffee. To enhance visual comprehensibility, I applied Gestalt principles of visual perception, particularly Proximity and Similarity. Additionally, each visual begins with its main message presented in plain text for clarity.

Future Promotional Messaging & Targeting

Based on the findings I would suggest the following:

  • Cluster 1 - Frequent Low-Spenders (30% of Revenue) This group should be aimed with offers with lower difficulty (due to their lower income) and shorter durations. Despite their low average transaction value, their frequent visits to the store suggest that such offers could further increase their visit frequency.
  • Cluster 2 - Weekly Low-Spenders These customers would benefit from offers with lower difficulty (because of lower income) but longer durations, given their limited contact with the café.
  • Cluster 3 - Weekly Mid-Spenders (44% of Revenue) This group would respond well to offers with higher difficulty and longer durations. These offers can help maintain their higher spending levels and encourage continued engagement. Ensure to use the web channel as more older customers are in this group.
  • Cluster 4 - Frequent High-Spenders To maximize revenue and ensure customer retention for this group, provide offers with high difficulty and medium durations. This approach aims to both challenge them and maintain their high level of spending. Ensure using the web channel as more older customers are in this group.

Additional project images

Discussion and feedback(5 comments)
comment-1740-avatar
Shahzad Panthaki
Shahzad Panthaki
about 2 months ago
This is amazing!

comment-1880-avatar
Cherry Aye Mya Mya Tun
Cherry Aye Mya Mya Tun
16 days ago
Your brief is clear and make me to learn more about stats. I wonder if you upload your Python script somewhere else. I would like to learn your script. Thank you very much.

comment-1934-avatar
Taha Nawfal
Taha Nawfal
2 days ago
great work
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.