__STYLES__

Customer Churn Analysis Using SQL, Python and Tableau šŸ šŸ“Š

Customer Churn Analysis Using SQL, Python and Tableau šŸ šŸ“Š

About this project

Introduction

undefined


Dollar Bank was concerned that more and more customers were leaving its credit card services. They asked for a seasoned Data Analyst to analyze the problem for them in order to understand the main reasons for customers leaving the services. They also needed recommendations for how the bank can mitigate further customer churns. Eventually, the bank wanted to proactively implement these recommendations in order to keep their customers happy.


TLDR: Summary

The Customer Churn Analysis End-to-End Project for Dollar Bank was aimed at understanding why customers were leaving its credit card services, and to recommend ways that would mitigate churn. SQL queries were used to explore the data, and Python-based analysis examined variables, summary statistics, outliers, and data quality. The Tableau dashboard highlighted the importance of customer demographics, account information, and transaction behavior in predicting and preventing churn. A PostgreSQL database was set up on AWS RDS following specific configuration steps, to simulate real-life work scenarios where data would be housed in some type of data architecture, and not always in CSV files.

Data ingestion involved creating tables and importing data using pgAdmin4. Exploratory Data Analysis (EDA) was conducted through SQL queries, while deep-dive analysis in Python included summary statistics, data cleaning, distribution analysis, and cross-correlation analysis. Various data questions were raised and answered, such as understanding product counts, credit card categories, and transaction changes. The findings were visualized in an interactive Tableau dashboard.

Recommendations suggested focusing on retaining established customers, particularly female blue cardholders earning less than $40k, and reducing the overall churn rate. Strategies such as targeted marketing campaigns, financial education for graduates, brand partnerships, loyalty programs, personalized services, credit counseling, and rewards and incentives were recommended to mitigate churn and improve customer retention.

Additionally, Pareto analysis was performed to identify the vital few customer segments that contributed the most to churn. The analysis revealed that focusing on approximately 20% of the churned customers could significantly reduce churn. By increasing the focus on these vital few segments to around 30% of the total churned customers, Dollar Bank could effectively capture more customer segments and further reduce churn. This approach allowed the bank to prioritize its efforts and resources towards retaining the customers who had the highest impact on churn.


Link to Interactive Tableau Dashboard


Methodology

SQL queries were used to explore and understand the data, and joining all 3 datasets was critical in answering research questions to generate valuable insights for the business.

Python-based deep-dive analysis examined data variables, their datatypes, summary statistics, outliers, data quality, and tidiness issues. This process also included distribution analysis of each variable in the dataset to identify valuable business insights, and then cross-correlation analysis to study relationships between variables, and identify variables of interests that can help raise data questions to solve the business problem at hand.

The Tableau dashboard highlighted the importance of understanding customer demographics, account information, and transaction behavior in predicting and preventing customer churn. Click the link below to use the dashboard.


Database Creation

I setup a relational database to simulate real-life work scenarios where data would typically be housed in some type of data architecture, and not usually in CSV files unless when the size and complexity of the company's data needs are quite simple.

To create and set up the PostgreSQL database on AWS RDS (Free Tier), I followed these steps:

  1. Logged into my AWS account: If you don't have one already, create an AWS account at https://aws.amazon.com/.

  2. Accessed AWS Management Console and navigated to the RDS (Relational Database Service) section.

  3. Created a new RDS instance by clicking on "Create database" or "Create instance" in the RDS dashboard. Choose the "Amazon RDS" option and then select "PostgreSQL" as the database engine.

undefined

  1. Configured instance settings by specifying the following settings:
  • DB engine version: Selected postgres version 14.6

  • Instance identifier: Created a unique name for my PostgreSQL instance.

  • Master username: Created an admin username for my database.

  • Master password: Created a secure password for the admin account.

undefined

  1. Configured DB instance size: Selected the instance class (CPU, memory, and storage) based on my requirements. For this project my workload is small so I opted for the db.t3.micro with 1 GiB and 2vCPUs, which is a Burstable class that provides a baseline level of CPU performance with the ability to burst above the baseline.

  2. Storage settings: Selected the storage type (SSD or magnetic) and specified the allocated storage size. I opted for the General Purpose SSD (gp2) with 200 GiB, just incase I decided to enrich the database with more variables that could help predict customer churn, or even update the database with recent customer activities.

  3. Configured the network and security settings: Because it's a free account, I only had access to the default VPC and security group. This configuration is what grants me access to the PostgreSQL instance.

  4. Set up database options: Entered the DB parameter group, configured automatic backup, and set the backup retention period to the default of 7 days.

  5. Configured maintenance settings: Selected a maintenance window and opted in for automatic minor version upgrades.

  6. Launched the instance: Reviewed my settings and clicked "Create database" or "Launch DB instance."

  7. Connected to the instance: Once the instance was available, I noted the endpoint and port number in the instance details. Then used these credentials to connect to the DB instance using a PostgreSQL client or application. For simplicity, I used pgAdmin4 locally to connect to my RDS DB instance and created the 'dollarbank' database. See below

undefined


Useful Resource for this Step: [Medium Article] "Using RDS on AWS with Jupyter Notebooks" by Samantha Jackson (@sjacks) on medium.com


Data Ingestion

In pgAdmin4, I used DDL statements to create the 3 tables I needed for this project.

undefined

See below, the Entity Relationship Diagram (ERD) for the postgreSQL database.

undefined

After creating all 3 tables and Entity Relationship Diagram (ERD), next step was to ingest/import data into the tables. For simplicity, I also used pgAdmin4 to import the data from the respective CSV files I had downloaded locally.

undefined


Data Description

In this task, few datasets are provided:

  1. bank_churners.csv - this file contains basic information about each client (10 columns). The columns are:

    • CLIENTNUM - Client number. Unique identifier for the customer holding the account;
    • Attrition Flag - Internal event (customer activity) variable - if the client had churned (attrited) or not (existing).
    • Dependent Count - Demographic variable - Number of dependents
    • Card_Category - Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
    • Months_on_book - Period of relationship with the bank
    • Months_Inactive_12_mon - No. of months inactive in the last 12 months
    • Contacts_Count_12_mon - No. of Contacts in the last 12 months
    • Credit_Limit - Credit Limit on the Credit Card
    • Avg_Open_To_Buy - Open to Buy Credit Line (Average of last 12 months)
    • Avg_Utilization_Ratio - Average Card Utilization Ratio
  2. basic_client_info.csv - this file contains some basic client info per each client (6 columns)

    • CLIENTNUM - Client number. Unique identifier for the customer holding the account
    • Customer Age - Demographic variable - Customer's Age in Years
    • Gender - Demographic variable - M=Male, F=Female
    • Education_Level - Demographic variable - Educational qualification of the account holder (example: high school, college graduate, etc.)`
    • Marital_Status - Demographic variable - Married, Single, Divorced, Unknown
    • Income_Category - Demographic variable - Annual Income Category of the account holder (< 40K, 40K - 60K, 60Kāˆ’80K, 80Kāˆ’120K, > $120K, Unknown)
  3. enriched_churn_data.csv - this file contains some enriched data about each client (7 columns)

    • CLIENTNUM - Client number. Unique identifier for the customer holding the account
    • Total_Relationship_Count - Total no. of products held by the customer
    • Total_Revolving_Bal - Total Revolving Balance on the Credit Card
    • Total_Amt_Chng_Q4_Q1 - Change in Transaction Amount (Q4 over Q1)
    • Total_Trans_Amt - Total Transaction Amount (Last 12 months)
    • Total_Trans_Ct - Total Transaction Count (Last 12 months)
    • Total_Ct_Chng_Q4_Q1 - Change in Transaction Count (Q4 over Q1)


SQL tasks

undefined

*Dollar Bank Entity Relationship Diagram (ERD)*


After a thorough discussion with relevant business units and project stakeholders, business questions were translated into simple & advanced SQL queries to provide answers accordingly.

Find below some of the business questions asked, and their answers derived from using SQL queries.


Question:

Letā€™s define a new variable called age_group:

  • 10 < x ā‰¤ 30
  • 30 < x ā‰¤ 40
  • 40 < x ā‰¤ 50
  • 50 < x ā‰¤ 60
  • 60 <x ā‰¤ 120

Per each age_group, marital_status and income_category, find out the following values:

  • a. Churn_rate (in %)
  • b. Average Total_Relationship_Count
  • c. Minimum value of Total_Amt_Chng_Q4_Q1
  • d. Count of customers

(Make sure to order the data by the number of customers in descending order)


Solution:

-- Letā€™s define a new variable called age_group:

--10 < x ā‰¤ 30
--30 < x ā‰¤ 40
--40 < x ā‰¤ 50
--50 < x ā‰¤ 60
--60 <x ā‰¤ 120

-- Rather than permanently updating the table, I am including this new variable in a view. This way I can easily refer to it without creating it every time its needed. 

DROP VIEW IF EXISTS demographics;
CREATE VIEW demographics AS (

    SELECT 
        clientnum,
        CASE WHEN customer_age > 10 and customer_age <= 30 THEN '11 - 30' 
        WHEN customer_age > 30 and customer_age <= 40 THEN '31 - 40'
        WHEN customer_age > 40 and customer_age <= 50 THEN '41 - 50'
        WHEN customer_age > 50 and customer_age <= 60 THEN '51 - 60'
        WHEN customer_age > 60 and customer_age <= 120 THEN '61 - 120'
        END AS age_group,
        marital_status,
        income_category
    FROM basic_client_info
    )
;

-- Solution approach: Created a Pivot Table that groups the dataset by demographic variables: age group, marital status and income category, while also aggregating individual values into a summary of the churn rate, avg total relationship count, minimum total amount change [Q1-Q4] and number of customers, per each demographic group.

WITH churned AS (
    
    SELECT 
        dem.clientnum AS clientnum,
        age_group,
        marital_status,
        income_category,
        total_relationship_count,
        total_amt_chng_q4_q1,
        CASE WHEN attrition_flag = 'Attrited Customer' THEN 1
        ELSE 0 END AS is_churned
    FROM demographics AS dem
    JOIN bank_churners AS bc
        ON dem.clientnum = bc.clientnum
    JOIN enriched_churn_data AS ecd
        ON dem.clientnum = ecd.clientnum
    )

SELECT 
    age_group, 
    marital_status, 
    income_category,
    ROUND(100 * SUM(is_churned) / (SELECT COUNT(*) FROM bank_churners)::numeric, 1) AS churn_rate_percent,
    ROUND(AVG(total_relationship_count)) as avg_total_product_count,
    MIN(total_amt_chng_q4_q1) as min_amt_chng_q4_q1,
    COUNT(clientnum) as client_count
FROM churned
GROUP BY 1, 2, 3
ORDER BY 1, 7 DESC
;

Out[9]:

undefined

Question:

Weā€™re interested in knowing which client (CLIENTNUM) has the 2nd highest Total_Trans_Amt, Per each Marital_Status.

Solution:

-- Which client (CLIENTNUM) has the 2nd highest Total_Trans_Amt, Per each Marital_Status.

WITH t1 AS (
    SELECT
        bci.clientnum,
        marital_status,
        total_trans_amt,
        DENSE_RANK() OVER (PARTITION BY marital_status ORDER BY total_trans_amt Desc) AS rnk
FROM basic_client_info AS bci
JOIN enriched_churn_data AS ecd
    ON bci.clientnum = ecd.clientnum
)
SELECT 
    marital_status,
    clientnum AS client_with_2nd_highest_trans_amt
FROM t1
WHERE rnk=2
;

undefined


You can see more SQL tasks via the link below:

Link to Notebook for SQL Tasks


Python Deep Dive Analysis


Database Integration

Import all the libraries required for this project, and connect to the postgres database hosted on AWS RDS

In [1]:

# Import numpy library for performing scientific computing as well as various mathematical operations.
import numpy as np

# Import pandas library for loading tables from the DB into dataframes to perform data wrangling & analysis.
import pandas as pd

# Import the matplotlib.pyplot and seaborn libraries for creating data visualizations and plots.
import matplotlib.pyplot as plt
import seaborn as sns

# Import psycopg2 library that will connect to the database
import psycopg2

# Import the config.py file and alias as creds. The config.py file contains all credentials for the postgres DB hosted by AWS.
import config as creds

# Import the create_engine function from sqlalchemy 
from sqlalchemy import create_engine

# Use sqlalchemy create_engine function to create a dialect object tailored towards postgres and assign the connection string to engine
# i.e engine = create_engine(dialect+driver://username:password@host:port/database)

try: # Using a try-except block to handle any exceptions that might occur during the connection process
    engine = create_engine(f"postgresql+psycopg2://{creds.PGUSER}:{creds.PGPASSWORD}@{creds.PGHOST}:{creds.PGPORT}/{creds.PGDATABASE}")
    # Test the connection by executing a simple query. If the connection is successful, the first print statement will be executed
    with engine.connect() as connection:
        connection.execute("SELECT 1")
    print(f"Successfully connected to the {creds.PGDATABASE} database!")
except Exception as e: # If unsuccessful, an error message will be printed
    print(f"Error: Failed to connect to the {creds.PGDATABASE} database!")
    print(f"Exception details: {e}")

Successfully connected to the dollarbankdb database!


Load data into pandas dataframes

In [2]:

# Load all 3 tables from the dollarbank DB into 3 respective pandas dataframes, using engine (the connection string we created) as an argument in the pd.read_sql_table() pandas method that reads data from a SQL DB

table1 = 'bank_churners'
table2 = 'basic_client_info'
table3 = 'enriched_churn_data'

bank_churners_df = pd.read_sql_table(table1, engine)
basic_client_info_df = pd.read_sql_table(table2, engine)
enriched_churn_df = pd.read_sql_table(table3, engine)


Resources Used

  • "Using RDS on AWS with Jupyter Notebooks - Creating, Connecting, and Querying a PostgreSQL Database on AWS" by Samantha Jackson. Link

  • SQLAlchemy documentation: Link


Now let's see the first 3 rows of each dataframe we have

In [3]:

# Displaying first 3 rows of bank_churners_df
bank_churners_df.head(3)

undefined

# Displaying first 3 rows of basic_client_info_df
basic_client_info_df.head(3)

undefined

# Displaying first 3 rows of enriched_churn_df
enriched_churn_df.head(3)

undefined

Summary Statistics Short Analysis

I checked the characteristics of the datasets including basic statistics, central tendency, count of unique values in each column, whether or not there are missing values and any outliers in numerical columns.

The next step was to assess important information about the variability and skewness of the datasets, after dealing with outliers if any were present.

To avoid too much information being presented here in the portfolio, I have not included the outcome of this step. Have a look at the Python notebook to see what was done.

Some potential outliers were initially detected, but after reading the column definition and comparing the lower and upper Limits for each column to the min & max values, it was concluded that the high & low values that were initially detected were actually not outliers.

Link to the Python notebook for Summary Statistics


Data Cleaning

In the summary statistics analysis, Pandas.DataFrame methods were used to assess dataset properties. 'df.info()' provided summary information, 'df.describe()' offered descriptive statistics, and 'df.value_counts()' counted unique values. I performed outlier detection and data was validated to check if identified values were actually outliers.

Additionally, I needed to convert the 'Clientnum' variable from an integer datatype to a string/object datatype for numerical analysis tasks. I used the define-code-test framework to perform this cleaning task.

Note: Storing unique ids as integers in SQL databases is good practice due to performance and efficiency reasons (integers take up less space in memory than strings, leading to faster joins and more efficient processing). I needed to convert this variable to strings only for the sake of numerical analysis such as cross-correlation analysis, descriptive stats and answering data questions.


In [18]:

# Making copies of all 3 datasets
df1 = bank_churners_df.copy()
df2 = basic_client_info_df.copy()
df3 = enriched_churn_df.copy()

# And then joining the resulting dataframes into one master dataset for easier access and manipulation
df = df1.merge(df2, on='clientnum').merge(df3, on='clientnum')

# Checking to make sure all 3 dataframes and their respective columns were joined correctly
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   clientnum                 10127 non-null  int64  
 1   attrition_flag            10127 non-null  object 
 2   dependent_count           10127 non-null  int64  
 3   card_category             10127 non-null  object 
 4   months_on_book            10127 non-null  int64  
 5   months_inactive_12_mon    10127 non-null  int64  
 6   contacts_count_12_mon     10127 non-null  int64  
 7   credit_limit              10127 non-null  float64
 8   avg_open_to_buy           10127 non-null  float64
 9   avg_utilization_ratio     10127 non-null  float64
 10  customer_age              10127 non-null  int64  
 11  gender                    10127 non-null  object 
 12  education_level           10127 non-null  object 
 13  marital_status            10127 non-null  object 
 14  income_category           10127 non-null  object 
 15  total_relationship_count  10127 non-null  int64  
 16  total_revolving_bal       10127 non-null  int64  
 17  total_amt_chng_q4_q1      10127 non-null  float64
 18  total_trans_amt           10127 non-null  int64  
 19  total_ct_chng_q4_q1       10127 non-null  float64
 20  total_trans_ct            10127 non-null  int64  
dtypes: float64(5), int64(10), object(6)
memory usage: 1.7+ MB

Issue to be Cleaned

Define

Change datatypes of clientnum variable from integer to string/object datatype

Code

In [19]:

df = df.astype({'clientnum': 'object'})

Test

In [22]:

df.clientnum.info()

<class 'pandas.core.series.Series'>
Int64Index: 10127 entries, 0 to 10126
Series name: clientnum
Non-Null Count  Dtype 
--------------  ----- 
10127 non-null  object
dtypes: object(1)
memory usage: 158.2+ KB


Distribution Analysis for each of the columns in the dataset

Examined value distributions for each column/variable. Histograms were used for non-unique values, while bar charts were employed for variables with fewer than 10 unique values.

Go to the Python notebook to view the distribution analysis.

Link to Notebook for Distribution Analysis


Cross-Correlation Analysis

By cross-examining columns against each other, more insights and broader questions can be asked and answered. So let's see how variables are correlated to each other.

In [10]:

# Using correlation heatmaps to display the correlation between variables
# Importing matplotlib.pyplot and seaborn libraries
import matplotlib.pyplot as plt, seaborn as sns

# Set the figure size, and plot a heatmap displaying correlations between columns in the df using the #"rocket_r" color map
fig, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True,cmap="rocket_r")

# Rotate the x-axis tick labels by 75 degrees and show the plot
plt.xticks(rotation=75)
plt.show()

undefined

Raising Data Questions

After performing the distribution and cross-correlation analyses, I had a better understanding of the bank's dataset, and could raise data questions to try and find clues and answer to help in my investigation. Grouping by the attrition flag or categorical demographic columns would help in plotting bar charts, pie charts, creating text tables and even tree maps using columns that hold numerical data about credit card usage patterns and repayment habits of all clients of the bank, in order to find reasons for customers churning and recommend strategies for the bank to mitigate existing customers from churning.

Some data questions that were raised included:

  • How do the total relationship count (i.e number of products customers have) differ between age groups, or gender, or income category or any other demographic? A bar chart of total relationship count (numerical datatype) vs demographic categorical column can be used to answer this question.
  • On average, how many of the bank's products do customers typically have? Who are the top or bottom customers by number of the bank's products they have? A histogram plot of the total_relationship_count and a text table showing top and bottom customers of the bank by their total product count, can be used to answer such questions.
  • Which is the most popular credit card category? What type of credit card holders have churned the most? What is the credit limit according to the different credit card categories? How much credit is currently available for a customer to use based on the type of credit card they have? A histogram of the card_category, bar chart of card_category vs attrition_flag, text table of average credit_limit grouped by card_category, and a text table of avg_open_to_buy grouped by card_category can answer such questions.
  • Does the number of dependents per customer affect their credit card needs or usage? Is there a relationship between dependent count and the type of credit card/credit limit a customer has? Plotting a graph of dependent count vs avg utilization ratio or total_trans_ct can answer the first question. A bar chart of dependent count vs credit limit or card category vs dependent count can answer the second question.
  • Who are the top (10 or 20) and bottom (10 or 20) clients based on % change in transaction count or % change in transaction amount from Q1 to Q4? Are there any significant differences in % change for churned customers compared to existing customers? Are there any observed patterns when plotting these % change columns grouped by demographic columns? The total_amt_chng_q4_q1 and total_ct_chng_q4_q1 columns which are both numerical columns can be used to create a text table of top or bottom (10 or 20) clients based on their unique client numbers (clientnum column). Bar charts can also be used to answer these questions by grouping demographic columns like gender, marital status, education or income category.

I selected the most investigative questions out of the list of so many questions to find reasons why customers are churning and other insights that might have had a direct or indirect connection to the main problem.


Question : What type of credit card holders have churned the most? How about Customer Retention?

In [44]:

import numpy as np

result1 = df.groupby('card_category')['clientnum'].count().reset_index(name='total_customer_count')
result2 = df.groupby('card_category')['attrition_flag'].apply(lambda x: (x == 'Attrited Customer').sum()).reset_index(name='churned_count')

text_table = result1.merge(result2, on='card_category')

text_table['overall_churn_rate_pct'] = np.round(text_table['churned_count'] / len(df) * 100, 2)

text_table['category_churn_rate_pct'] = np.round(text_table['churned_count'] / text_table['total_customer_count'] * 100, 1)

text_table['category_retention_ratio'] = np.round(text_table['total_customer_count'] / text_table['churned_count'], 1)

text_table.sort_values(by='overall_churn_rate_pct', ascending=False)

undefined

Explanation:

The churn rate % shows that the bank's blue credit card holders churned the most with a 15% churn rate. For Silver, Gold and Platinum credit card customers, the churn rate is less than 1%.

The category churn rate % and category retention ratio variables are being used here to highlight the likelihood of churn and provide a comparison of the initial customer base to the churned customers within each category. This is because the number of customers with blue credit cards are relatively larger than all the other three card categories combined. A higher category retention ratio indicates a lower proportion of churned customers to the initial customer base, which is generally a positive sign for customer retention within that category. Conversely, a lower ratio could indicate a potential issue with customer retention that might need to be addressed for that specific category.

Now all of a sudden we see that 1 in 4 platinum credit card holders churned, 1 in approx 5 Gold credit card customers churned, 1 in 6 Blue credit card customers churned, and finally 1 in approx 7 Silver credit card customers churned. From this perspective, it appears that there is a potential issue with customer retention of Platinum credit card customers. This is because out of a small number of only 20 total customers with a platinum credit card, 4 of them have churned.

Could this be related to the income category of customers in certain card categories? Or could there be something else happening? Let us investigate further.


Question : Was there a relationship between credit card category, income category and the usage of credit cards by customers?

# Grouping by card category & income category to see card utilization trends
view = df.groupby(['card_category', 'income_category'])['avg_utilization_ratio'].mean().sort_values(ascending=False)

# Using a suitable figure/chart size for the plot
plt.figure(figsize = [10, 5])
ax = view.plot(kind='bar', xlabel="\nCard Category , Income Category" , ylabel= "Avg Utilization Ratio", legend=False)

ax.bar_label(ax.containers[0], fmt='%.2f', label_type= 'edge')

plt.show();

undefined

Explanation:

The plot above shows that on average, Blue credit card customers who earn less than $40k and those who earn between 40k-60k tend to use their credit cards the most; while Platinum credit card customers who earn between 40k-60k and those whose income category is unknown, tend to use their credit card the least.

On average, Blue credit card customers of all income categories tend to use their cards way more than Silver, Gold and Platinum credit card customers. And average usage trends amongst Silver, Gold and Platinum credit card customers who earn higher than 60k does not differ much, indicating that these customers are not heavily dependent on their credit cards and might also suggest that they are less likely to be interested in trying out other credit products/services the bank offers.

It is worth investigating further to see how factors such as dependent count affect the credit card needs of customers.


Question : How did income category & number of dependents per customer affect credit card needs or usage?

# Grouping by dependent count & income category to see card utilization trends
view = df.groupby(['dependent_count', 'income_category'])['avg_utilization_ratio'].mean().sort_values(ascending=False)

# Using a suitable figure/chart size for the plot
plt.figure(figsize = [12, 5])
ax = view.plot(kind='bar', xlabel="\nDependent Count , Income Category" , ylabel= "Avg Utilization Ratio", legend=False)

#ax.bar_label(ax.containers[0], fmt='%.2f', label_type= 'edge')

plt.show();

undefined

Explanation:

There was no clear correlation between credit card utilization and number of dependents, but when income category was grouped with dependent count, it yielded some interesting insights.

On average, customers who earned less than 40k used their credit cards the most regardless of their number of dependents. Customers who earned between 40k-60k and those who earned between 60k-80k were not significantly different in their card usage rates regardless of their dependent count. But as earnings further increased, we observed a trend where credit card utilization decreased as number of dependents increased, especially for customers earning 120k or higher.

In general, card utilization decreased as income increased suggesting that customers who earned higher tended to use their credit cards less.


Question : What age groups utilized their credit cards the most and the least? And What age groups had the most churned customers?

# Creating a new column to group customers according to their age groups
import numpy as np

conditions = [
    (df['customer_age'] > 10) & (df['customer_age'] <= 30),
    (df['customer_age'] > 30) & (df['customer_age'] <= 40),
    (df['customer_age'] > 40) & (df['customer_age'] <= 50),
    (df['customer_age'] > 50) & (df['customer_age'] <= 60),
    (df['customer_age'] > 60) & (df['customer_age'] <= 120)
    ]

values = ['11 - 30', '31 - 40', '41 - 50', '51 - 60', '61 - 120']
    
df['age_group'] = np.select(conditions, values)



first_view = df.groupby(['age_group'])[('avg_utilization_ratio')].apply(lambda x: np.round(x.mean(), 2)).sort_values(ascending=False)

second_view = df.groupby(['age_group'])['attrition_flag'].apply(lambda x: (x == 'Attrited Customer').sum()).sort_values(ascending=False)

# Resize the chart, and have two plots side-by-side
# Set a larger figure size for subplots
plt.figure(figsize = [12, 5]) 

# 1 row, 2 cols, subplot 1
plt.subplot(1, 2, 1)
# Plot a bar chart with the data
ax = first_view.plot(kind='bar', xlabel="\nAge Group" , ylabel= "Avg Utilization Ratio", legend=False)
# Include the bar labels
ax.bar_label(ax.containers[0], label_type= 'edge')
# Rotate the x ticks
plt.xticks(rotation=10)

# 1 row, 2 cols, subplot 2
plt.subplot(1, 2, 2)
# Plot a bar chart with the data
ax = second_view.plot(kind='bar', xlabel="\nAge Group" , ylabel= "Count of Churned Customers", legend=False)
# Include the bar labels
ax.bar_label(ax.containers[0], label_type= 'edge')
# Rotate the x ticks
plt.xticks(rotation=10)

# Now show the figure
plt.show();

undefinedExplanation:

On average, the difference in credit card utilization amongst age groups was not so much; with customers in the 11-30 age group using their credit cards the most while customers in the 41-50 age groups had the lowest utilization ratio. However, the count of churned customers per age group added context to this analysis. Customers aged between 41 and 50 churned the most, while customers between 11 and 30 churned the least. From this perspective, a big focus could be made on customers in the 41-50 age group, and more data would have to be collected to enrich the dataset and provide deeper insights about this group of customers in particular and all age groups in general.


Link to Notebook for All Python Tasks


Insights from Tableau Dashboard


undefinedOverall Churn Rate: The overall churn rate is 16%, which indicates that around one in six customers were churning. This is a key metric for the bank to continuously monitor in its effort to mitigate churn and increase customer retention.

Transaction Amounts: The dashboard also shows the average transaction amount for all customers is $4,404, which suggests that the bank's customer base includes customers with a range of spending levels. However, the average transaction amount for churned customers is $3,095, which is significantly lower than the average for all customers. This may indicate that customers who spend less money on transactions are more likely to churn, perhaps because they are not as invested in the bank's services or do not find them as useful. The bank may want to consider strategies for encouraging customers to increase their transaction amounts in order to improve retention rates. For example, they could offer rewards programs or incentives for customers who maintain high transaction volumes or balances. Alternatively, they could explore ways to improve their services or product offerings to make them more attractive to customers with lower transaction amounts.


undefinedCustomer Demographics: Overall, the average customer age is 46 yrs, and there are more female customers than male customers, with the majority of customers in the 41-50 age range. The churn summary indicates that female clients also aged between 41-50 have the highest churn rate, with 4.4% of the 2461 female clients in this age group having churned, compared to 3.3% of the 2191 male clients in the same age group. Overall, the churn rate is higher for females at 9.2% across all age groups, compared to 6.9% for males across all age groups.

According to the analysis, established customers (those who have been with the bank for 25 to 36 months) have the highest churn rate at 5% for females and 3.6% for males. The second-highest churn rate is seen in long-term customers (those with the bank for 37-48 months), with rates of 2.8% for females and 2.3% for males.

The analysis also shows that the largest number of churned customers were female blue credit card holders earning less than $40k, with a total count of 559. The second highest count, at 215, were male blue credit card holders earning between $80k -$120k. However, a majority of customers across all income categories owned blue credit cards. Both male and female customers who churned were established and long-term customers of the bank. Male blue cardholders had churn counts ranging from 97 to 215 across income categories of $40k -$120k+, with all churned male blue cardholders being established and long-term customers of the bank.

This information can be used to tailor marketing and communication efforts to specific demographic groups to mitigate customer churn and increase customer retention.


undefinedCustomer Behavior: Visualizing customer spending behaviors and connection to churn reveals that customers with higher number of transactions and higher transaction amounts are less likely to churn. Additionally, the scatter plot shows that customers who churned made less than 100 transactions and typically spent less than 5K before churning, regardless of income category. This suggests that offering incentives for customers to maintain high levels of activity and transaction amounts may be an effective retention strategy.


undefinedShort-Term vs Long-Term Focus: Vital few - The Pareto analysis reveals that around 20% of the total churned customers are responsible for approximately 80% of the total churn. This means that a small subset of customers is responsible for the majority of customer churn. Customer segments: The dashboard identifies two customer segments that constitute the vital few: female graduates who are married or single, and male graduates who are single. This suggests that Dollar Bank may need to pay special attention to these customer segments to reduce churn.


Recommendations

The analysis highlights the need for the bank to focus on retaining its established and long-term customers in all income categories, especially female blue cardholders earning less than $40k. By focusing on the vital few customer segments, Dollar Bank can reduce the overall churn rate to less than 7%, which is a generally acceptable churn rate level for most banks and credit card companies. The vital few of around 20% can be increased to 30% of the total churned customers to capture more customers segments and further reduce churn.

Targeted marketing ad campaigns should be used to effectively reach the customer segments contributing the most to churn. Examples of such campaigns include:

Providing financial education and resources to both male and female graduates, such as webinars, podcasts, or blog posts, to help them manage their finances more effectively, improve their financial literacy, and pay back student loans and other expenses without running into debt.

Partnering with brands that appeal to the target audience (e.g partnering with fitness brands and offering discounts on gym memberships or workout gears to male and female graduates who are married or single)

Creating a loyalty program specifically for established and long-term customers, which could include perks such as waived fees, free financial planning sessions, or personalized investment advice.

Providing personalized investment advice and retirement planning services to established and long-term customers, which could help them maximize their savings and achieve their long-term financial goals.

Offering credit counseling services to blue credit card holders who earn less than $40k per year, which could help them improve their credit scores and reduce their financial stress.

Rewards and incentives e.g sign-up bonuses, discounted interest rates on loans or cash back rewards on purchases, to increase credit card usage and attract new customers to the bank


Data Enrichment

When enriching a customer churn dataset for a bank's credit card service, the following additional variables may be useful to collect:

Internal data points such as customer complaints history, complaints resolution times, customer satisfaction ratings/comments/suggestions, can help perform better customer churn and retention analyses.

External data points like industry trends and economic indicators such as inflation rate, interest rates, unemployment rate, consumer sentiments and changes in GDP. This is because high inflation can reduce the purchasing power of consumers and make it more difficult for them to manage credit card debt. Changes in interest rates can affect the affordability of credit card debt and the attractiveness of credit card rewards programs; by benchmarking with the average industry interest rate, the bank can offer more competitive interest rates to retain existing clients and mitigate churn. High unemployment rates can affect consumer confidence and spending habits, which can impact credit card usage. And finally, consumer sentiments as well as changes in GDP greatly impact spending habits and credit card usage. Thus these economic indicators can provide valuable context for understanding customer behavior and predicting churn, while also identifying opportunities to improve customer retention and maximize profitability.

These external data points are readily available through public/private, free and paid data sources online, some of which are:


Links

Link to Notebook for SQL & Python Tasks

Link to Tableau Dashboard

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.