__STYLES__
Goal
The task was to perform as a Senior Marketing Analyst at Maven Cafe, who just run a test by sending different combinations of promotional offers to existing rewards members. Now that the 30-day period for the test has concluded, the task is to identify key customer segments and develop a data-driven strategy for future promotional messaging & targeting. The results should be summarized in a report that will be presented to the CMO.
The data consists of a table with 17'000 unique existing rewards members including their age, income, gender, and date of membership; and another table capturing all events during the 30-day period, such as when offers were sent, viewed and completed, along with all customer transactions with the cafe.
Given the assignment, I decided to create a one-page explanatory dashboard with the key findings in a way that make it easy for the CMO to digest and act upon.
Data Cleaning
The data cleaning process was relatively straightforward and therefore not too time consuming. The main challenge was deciding how to handle customers with missing values, particularly those with an age recorded as 118. I chose to remove these entries from the dataset entirely.
Data Analysis
First, I created bins for age and income to group customers together. After some research, I defined the following age groups: 18-24 (Young Adults), 25-34 (Early Career Professionals), 35-44 (Young Families), 45-54 (Mature Professionals), 55-64 (Pre-Retirement), 65+ (Retirees). For income groups, I categorized customers into four brackets: 30-50k (Lower-Middle Income), 50-75k (Middle Income), 75k-100k (Upper-Middle Income) and 100-120k (Affluent). These groupings provided a solid overview of customer demographics and revenue distribution.
Key Customer Segments
To identify key customer segments, I employed a K-Means clustering algorithm in Python, using the variables age, income, amount purchased, and number of transactions. I standardized the data using StandardScaler and determined the optimal number of clusters by applying the elbow method and evaluating the silhouette score, as demonstrated in the following code:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
features = ['age', 'income', 'Amount Purchased', 'No. of Transactions']
X = df[features]
# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Finding the optimal number of clusters
inertia = []
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
Looking at both graphs I chose k=4 as optimal number of clusters because it provides the best balance between cluster tightness (inertia) and the clarity of cluster separation (silhouette score).
In the next and final step I applied dimensionality reduction for visualization using PCA:
# Choosing the optimal_k based on the Elbow Method or Silhouette Score
optimal_k = 4
# K-Means Clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)
# Dimensionality Reduction for Visualization using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Adding the PCA results to the df for plotting
df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]
Like this I was able to visualize the four customer segments in the scatter plot.
Offer- and Channel-Analysis
Each offer was sent randomly to customers across six events during the 30-day period, covering ten different offers. On average, each was sent around 1'100 times per event, totaling between 6'500 and 6'700 times. The first two events were spaced with a 7-day waiting period, while the remaining four events were sent every 3-4 days. The distribution of offers across age and income groups was relatively even. The majority of customers received 4-5 offers, suggesting a potential dependency between receiving multiple offers and the sequence in which they were sent. However, I assumed each offer was independent and did not explore this potential dependency further. This assumption implies that each offer was sent to a distinct subset of customers, effectively treating each group as a sample. To compare offers accurately, it is therefore essential to use statistical methods to validate the findings.
My objective was to identify which of the eight promotional offers (excluding the two informational offers) performed best. I chose to analyze two metrics: the redemption rate (or completion rate) and the revenue associated with completed offers.
The redemption rate showed a clear trend for two offers, and due to the large sample size, the 95% confidence intervals for proportions were narrow. This suggests that the redemption rates are consistent across the sample, which strenghtens the reliability of the results.
The revenue associated with completed offers showed a small variation, with the highest revenue at $97k USD and the lowest at $77k USD. The 95% confidence intervals for the average revenue associated with completed offers indicated that the three offers with the highest average revenue had significantly non-overlapping confidence intervals with the others (please see the dashboard for the visualization). This suggests a statistically significant difference in the revenue generated per transaction. However, the three have overlapping confidence intervals between themselves. Although performing T-tests to determine the best offer in terms of average revenue would be the next logical step, I excluded this from the current analysis.
Next, I analyzed the effectiveness of different channels. For instance, it might be expected that customers aged 18-35 would be more responsive to offers sent via mobile and social channels. The analysis showed that offers sent through mobile and social platforms had a much higher view rate than those sent via web+mail or web+mail+mobile. Due to the large sample size, the confidence intervals were again narrow. However, as seen in the following picture, contrary to expectations, the view rate was slightly higher among older rewards customers. For the non-web channel on the other hand, the view rate among older rewards customers is lower than for the younger ones.
Finally, I examined the effects of the 30-day time period on both metrics (redemption rate and revenue). The results indicated that sending offers more frequently increased average daily revenue but did not result in a higher overall redemption rate.
I also considered analyzing the time between receiving an offer and viewing it, as well as the time between viewing and completing it. However, I found that it was possible to complete an order without viewing the offer, and there were instances where the same offer was sent twice to a customer but only viewed once. These cases would skew the results. Additionally, there were instances where customers completed nearly every offer before viewing it, as seen in the following picture. Due to these inconsistencies, I chose not to pursue this analysis further.
Final Thoughts: The topic was very interesting and allowed me to refresh my knowledge of statistical methods. @Maven Analytics: I would greatly appreciate a guided project focused on statistics, perhaps based on this dataset. If there is already one available, please let me know. I believe in general many data analysts in your community could benefit a lot from it.
Design
I used a color palette tool to generate the color scheme, selecting various shades of brown to reflect the color of coffee. To enhance visual comprehensibility, I applied Gestalt principles of visual perception, particularly Proximity and Similarity. Additionally, each visual begins with its main message presented in plain text for clarity.
Future Promotional Messaging & Targeting
Based on the findings I would suggest the following: