__STYLES__
Link to the code: https://github.com/ignaciosoto04/Projects/blob/main/Credit_Score_Classification_v2.ipynb
OBJETIVES & MEASUREMENT Main Objective
DATA SOURCES Data Set Introduction The Credit score classification (Paris, 2022) dataset was downloaded from Kaggle. It has 28 columns and 100,000 rows that contain integer, float, and text values. There are 8 customers in the dataset, each of them with 8 entries from January to August.
Exclusions Each customer of the dataset has 7 entries from February to August. At the end of the data cleaning, I am only going to consider the last entry as it is the most recent for the customer.
DATA DICTIONARY Attached as an image.
DATA EXPLORATION Data Exploration Techniques The data exploration techniques used in this project were: • head(): for to check the first rows of the dataset. • columns: to see the columns names of the dataset. • info(): to check data types and no null count. • unique(): to see the unique the values for each column. • describe(): to have a first look of the column I was going to fix. • plt.hist(): to see the distribution of the values (for numeric variables). • plt.barh: to see the distribution of the values (for object variables).
Data Cleansing The first step of the data cleaning was to delete the unnecessary columns at the beginning. This were: ID, Name and SSN. Also, the column “Type of Loan” was removed because there is another column called 'Num_of_Loan' that counts how many Type of Loans are for each customer. The next thing was to clean the data individually for each column. For numeric columns, the describe() and plot.hist() functions began to be used to get a first look at the data distribution. Second, erroneous values such as “” were removed. Third, the outliers were eliminated by Nas. To fill the Nas with the correct value, the clients were grouped (there are 8 rows for each client), and it was analyzed which statistical measure (mean, min, max, mode) was the most appropriate for the change. Finally, a graph was made again to see that the distribution was correct. For text columns, the same analysis was done. Incorrect values such as “” and “!@9#%8” in each column were replaced with Nas. The Nas were filled by the mode of each client. This ensured that the appropriate value for each client was correct.
DATA PREPARATION After doing the data cleaning, the 8 rows for each client had practically the same values. To avoid overfitting, 7 of the 8 rows were removed, leaving the last row per customer to ensure that I would have the most recent values. It went from having 100,000 rows to 12,500.
MODEL EXPLORATION Since the objective is to classify a client's credit score, the model was defined as a supervised machine learning model. For this model, the following techniques were used: Logistic Regression, Decision Tree, Random Forest, Neural Network, Support Vector Machine, Support Vector Machine, and XGBoost. Before starting to run the models, the variable "Credit Score" was defined as an outcome and the other variables as predictors. Also, dummies were generated for the categorical variables and the data was split with a test size of 0.4. Each model was run and the confusion matrix and accuracy were displayed.
MODEL COMPARISON Attached as an image.
MODEL RECOMMENDATION According to the table that compares the models, it can be identified that the Decision Tree with Grid Search model is the one with the highest accuracy. Therefore, this would be the recommended model to use.