__STYLES__

Customer Segmentation Analysis (Power BI)

Tools used in this project
Customer Segmentation Analysis (Power BI)

Customer Segmentation (RFM) (Power BI)

About this project

Introduction

This dataset originates from Kaggle and represents the Global Superstore data. It has been deliberately chosen to enable a comprehensive examination of customer behavior through the implementation of Recency, Frequency, and Monetary (RFM) analysis. The primary objective of this study is to reveal meaningful patterns using the RFM methodology and to predict potential factors that may influence customer churn in the future.

Data Source: https://www.kaggle.com/datasets/gopikamahadevan/global-superstore

Background

As per my understanding, customer retention is generally more cost-effective than customer acquisition. Moreover, my keen interest lies in identifying patterns, and this nature serves as a strong motivation for me to delve deeper into this project.

Data Preparation

Data Cleaning

I import the data into the SQL Database from a CSV file using DDL (Data Definition Language) and DML (Data Manipulation Language). After conducting some exploratory analysis, I have identified an error in the "Product IDs". Typically, the "Product ID" should correspond with the "Product Name".

Click here to view my SQL code.

Customer Names can be the same for different customers, but the same logic does not apply to Product Names, as they should be unique.

undefined

When I delved further, I discovered that there are mostly different products assigned to the same Product ID. This is a clear indication of the need for data pre-processing.

undefined

So, I generated the new keys using one of my favorite SQL Window Functions, ROW_NUMBER, ensuring each product has a unique identifier. This data preprocessing step is important to improve data quality, consistency and facilitates accurate analyses and modeling.

undefined

After pre-processing the data with the new Product IDs, I performed DDL and DML operations again to create a new table called 'fact_sales_2'

undefined

And this is the result: In the meantime, I have changed the names after ensuring a secure backup. If the data contains errors, I keep it as a backup, but I never touch the raw file. This is my primary rule whenever I start analyzing any data.

undefined

Data Prep for RFM (Recency Frequency and Monetary)

I have created a dimension table named "dim_customer" to store the RFM scores derived using the NTILE Window Function in SQL. Initially, a Common Table Expression (CTE) was used to obtain the essential columns for RFM calculations:

  1. Max Order Date for "Recency"
  2. Count of Orders for "Frequency"
  3. The Sum of Profit (After Discount) for "Monetary"

Subsequently, the NTILE window function was applied to rank the scores from 1 to 10. Notably, if I were to analyze this data in Excel, I would definitely use the PERCENTILE.INC function, which shares similarities with the NTILE function in SQL.

Finally, I aggregated the RFM values and applied the NTILE function again to establish an adjusted rank based on the combined RFM metrics.

undefined

Dashboard Development

After establishing the SQL views, I connected them to Power BI for dashboard development. I prioritized the creation of the data models, which serves as the initial and fundamental step in commencing the dashboard development process.

Below, you can find a screenshot of my measure table (Blue box), which I established to systematically organize various measures. Additionally, I created distinct folder views for categorizing the different types of measures. Employing such practices is ultimately helping to mitigate potential future errors and debugging becomes more straightforward.

undefined

After that, I created a Calendar Table using DAX. While I am aware of the CALENDARAUTO function, in this case, I am certain about the specific columns and date range that I already explored and took notes on during the Exploratory Data Analysis. That's why I chose to create the Calendar Table this way.

undefined

Following that, I created a key column for the analysis, which is a calculated column for RFM using DAX (Data Analysis Expressions). I genuinely appreciate the SWITCH function in DAX because it saves a significant amount of time compared to writing lengthy Nested IF Statements.

undefined

Subsequently, I created this measure to identify active customers. My condition is that if a customer has not been active for 1 year, I consider them as churned or inactive. The maximum date in the dataset is December 2014, so customers who didn't make any purchases in 2014 are categorized as inactive.

undefined

I have carefully devised my final approach, which involves a systematic process to identify the essential columns for the analysis. Subsequently, I skillfully update the data by making modifications to the SQL database's views. This meticulous approach leads to a highly optimized Power BI model, ensuring the best possible outcomes for our analysis and visualization efforts.

undefined

Result

Insight 1: I have found that the customers in the "High Churn Risk" and "At Risk" segments are not receiving enough discounts. This could be a possible reason as the number of customers in these two segments is greater than that of the "Top Customers" and equal to that of the "Loyal Customers".

Insight 2: Purchase Frequency is not Co-related to the profit margin. Only the customers with a 40 to 70 Purchase Frequency are generating high profits.

undefined

Discussion and feedback(1 comment)
comment-543-avatar
laxman katti
laxman katti
7 months ago
Thanks for giving this sort of nice information to all of us and please keep us updated in future also. I want to share some information about the Data Structures And AlgorithmsFor best caree
2000 characters remaining