Objective:
The primary goal of this project is to build a bank customer churn classification model that prioritises recall as the key performance metric. By maximising recall, the model aims to identify as many customers at risk of churning as possible, helping the bank take proactive measures to retain them.
Key Metrics:
- Recall: The primary focus is to ensure that the model correctly identifies a large proportion of customers likely to churn.
- Precision: While recall is the priority, precision is also considered to reduce the number of false positives (incorrectly predicting churn).
- F1 Score: Used to balance precision and recall in the evaluation process.
- Accuracy: Used to monitor overfitting by comparing training and test set performance. A significant gap between train and test accuracy indicates overfitting, which is important to avoid for better generalisation.
Models Explored:
- Logistic Regression (with and without threshold tuning)
- Random Forest (with hyperparameter tuning)
- Gradient Boosting Machine (GBM) (with threshold tuning)
Data Preparation & EDA:
Data preparation and Exploratory Data Analysis (EDA) were completed in a previous phase, which provided key insights into the data, which helped in selecting relevant features and understanding the distributions of the target variable.
Methods Applied:
- Threshold Tuning: Applied to optimise recall across different models, adjusting the decision threshold to better capture customers at risk of churning.
- Feature Importance: Used to identify the most significant features impacting the model’s predictions, providing actionable insights for the business.
- SHAP Analysis: Implemented to enhance model interpretability by showing how individual features contribute to the prediction for each customer.
- Overfitting Evaluation: Model accuracy is used to detect overfitting by comparing training and test set performance. A large difference indicates overfitting, leading to adjustments in hyperparameters and threshold tuning to improve generalisation.
Business Impact:
This model will help the bank:
- Identify at-risk customers before they churn, allowing timely interventions.
- Target retention efforts more effectively by focusing on key drivers of churn, as revealed by feature importance and SHAP analysis.
- Improve customer satisfaction and minimise revenue loss by reducing churn rates.
Tools & Technologies:
- Python: Used as the primary programming language for all data manipulation, model building, and evaluation tasks.
- Pandas: For data manipulation and DataFrame management.
- NumPy: For numerical operations, including array management.
- Scikit-learn: For implementing machine learning models (Logistic Regression, Random Forest, Gradient Boosting), cross-validation, and metrics calculations (precision, recall, F1, etc.).
- Matplotlib & Seaborn: For data visualisation, including confusion matrices and feature importance plots.
- SHAP: For model interpretability and analysing the contribution of each feature to model predictions.
- Jupyter Notebook: As the environment for running the analysis, visualising results, and documenting the project.
Final Model Selection:
After comparing different models and focusing on recall, the Gradient Boosting Model (GBM) with threshold tuning was selected as the best-performing model. It achieved the highest recall while maintaining a balance with other metrics, such as recall, precision and F1 score, and avoiding overfitting.