__STYLES__
This dataset originates from Kaggle and represents the Global Superstore data. It has been deliberately chosen to enable a comprehensive examination of customer behavior through the implementation of Recency, Frequency, and Monetary (RFM) analysis. The primary objective of this study is to reveal meaningful patterns using the RFM methodology and to predict potential factors that may influence customer churn in the future.
Data Source: https://www.kaggle.com/datasets/gopikamahadevan/global-superstore
As per my understanding, customer retention is generally more cost-effective than customer acquisition. Moreover, my keen interest lies in identifying patterns, and this nature serves as a strong motivation for me to delve deeper into this project.
I import the data into the SQL Database from a CSV file using DDL (Data Definition Language) and DML (Data Manipulation Language). After conducting some exploratory analysis, I have identified an error in the "Product IDs". Typically, the "Product ID" should correspond with the "Product Name".
Click here to view my SQL code.
Customer Names can be the same for different customers, but the same logic does not apply to Product Names, as they should be unique.
When I delved further, I discovered that there are mostly different products assigned to the same Product ID. This is a clear indication of the need for data pre-processing.
So, I generated the new keys using one of my favorite SQL Window Functions, ROW_NUMBER, ensuring each product has a unique identifier. This data preprocessing step is important to improve data quality, consistency and facilitates accurate analyses and modeling.
After pre-processing the data with the new Product IDs, I performed DDL and DML operations again to create a new table called 'fact_sales_2'
And this is the result: In the meantime, I have changed the names after ensuring a secure backup. If the data contains errors, I keep it as a backup, but I never touch the raw file. This is my primary rule whenever I start analyzing any data.
I have created a dimension table named "dim_customer" to store the RFM scores derived using the NTILE Window Function in SQL. Initially, a Common Table Expression (CTE) was used to obtain the essential columns for RFM calculations:
Subsequently, the NTILE window function was applied to rank the scores from 1 to 10. Notably, if I were to analyze this data in Excel, I would definitely use the PERCENTILE.INC function, which shares similarities with the NTILE function in SQL.
Finally, I aggregated the RFM values and applied the NTILE function again to establish an adjusted rank based on the combined RFM metrics.
After establishing the SQL views, I connected them to Power BI for dashboard development. I prioritized the creation of the data models, which serves as the initial and fundamental step in commencing the dashboard development process.
Below, you can find a screenshot of my measure table (Blue box), which I established to systematically organize various measures. Additionally, I created distinct folder views for categorizing the different types of measures. Employing such practices is ultimately helping to mitigate potential future errors and debugging becomes more straightforward.
After that, I created a Calendar Table using DAX. While I am aware of the CALENDARAUTO function, in this case, I am certain about the specific columns and date range that I already explored and took notes on during the Exploratory Data Analysis. That's why I chose to create the Calendar Table this way.
Following that, I created a key column for the analysis, which is a calculated column for RFM using DAX (Data Analysis Expressions). I genuinely appreciate the SWITCH function in DAX because it saves a significant amount of time compared to writing lengthy Nested IF Statements.
Subsequently, I created this measure to identify active customers. My condition is that if a customer has not been active for 1 year, I consider them as churned or inactive. The maximum date in the dataset is December 2014, so customers who didn't make any purchases in 2014 are categorized as inactive.
I have carefully devised my final approach, which involves a systematic process to identify the essential columns for the analysis. Subsequently, I skillfully update the data by making modifications to the SQL database's views. This meticulous approach leads to a highly optimized Power BI model, ensuring the best possible outcomes for our analysis and visualization efforts.
Insight 1: I have found that the customers in the "High Churn Risk" and "At Risk" segments are not receiving enough discounts. This could be a possible reason as the number of customers in these two segments is greater than that of the "Top Customers" and equal to that of the "Loyal Customers".
Insight 2: Purchase Frequency is not Co-related to the profit margin. Only the customers with a 40 to 70 Purchase Frequency are generating high profits.