__STYLES__

Customer Segmentation and Insights from Messy Data, A Data Cleaning and Analysis Journey.

Tools used in this project
Customer Segmentation and Insights from Messy Data, A Data Cleaning and Analysis Journey.

Customer Segmentation and Insights from Messy Data, A Data Cleaning and Analysis Journey.

About this project

Customer Segmentation and Insights from Messy Data, A Data Cleaning and Analysis Journey.

In this report, I’ll detail my journey to transform a disorderly dataset into actionable insights through rigorous data cleaning and analysis. The dataset, titled “Jumbled-up Customer Details,” initially posed challenges with its mixed-up information, missing values, and inconsistent formats. However, through systematic data cleaning and thorough analysis, I unearthed valuable insights that can inform strategic decision-making.

The Messy Data

Initial messy data

The “Jumbled-up Customers Details” dataset comprised a tangled mess of names, addresses, ages, and genders. This complexity hindered the extraction of meaningful information and insights.

Data Cleaning Process

# Define a function to split the jumbled data
def split_customer_details(row):
    data = row[0] 
    parts = data.split('Address')
    name_part = parts[0].split('Name')[1].strip()
    address_part = 'Address' + parts[1]
    address_parts = address_part.split('Age')
    address = address_parts[0].replace('Address', '').strip()
    age_gender_part = 'Age' + address_parts[1]
    age_gender_parts = age_gender_part.split('Gender')
    age = age_gender_parts[0].replace('Age', '').strip()
    gender = age_gender_parts[1].replace('Gender', '').strip()
    
    return pd.Series([name_part, address, age, gender])
# Define a function to extract state from address
def extract_state(address):
    # List of Nigerian states
    nigerian_states = [
        'Abia', 'Adamawa', 'Akwa Ibom', 'Anambra', 'Bauchi', 'Bayelsa', 'Benue', 'Borno', 'Cross River',
        'Delta', 'Ebonyi', 'Edo', 'Ekiti', 'Enugu', 'FCT', 'Gombe', 'Imo', 'Jigawa', 'Kaduna', 'Kano', 
        'Katsina', 'Kebbi', 'Kogi', 'Kwara', 'Lagos', 'Nasarawa', 'Niger', 'Ogun', 'Ondo', 'Osun', 'Oyo', 
        'Plateau', 'Rivers', 'Sokoto', 'Taraba', 'Yobe', 'Zamfara'
    ]

    
    # Dictionary of cities and their corresponding states
    city_states = {
        'Onitsha': 'Anambra',
        'Abeokuta': 'Ogun'
    }
    
    if 'Abuja' in address:
        return 'FCT'
    
    for state in nigerian_states:
        if state in address:
            return state
    
    for city, state in city_states.items():
        if city in address:
            return state
    
    # If no state is found, return 'Unknown'
    return 'Unknown'
# Apply the function to each row
cleaned_data = Customer_details.apply(split_customer_details, axis=1)
# Rename the columns
cleaned_data.columns = ['Name', 'Address', 'Age', 'Gender']
# Extract state from each address
cleaned_data['State'] = cleaned_data['Address'].apply(extract_state)

I began by implementing a comprehensive data-cleaning process. Custom functions were developed to disentangle the data, splitting it into distinct columns, extracting crucial details like age, gender, and state from addresses, and addressing any missing values. This meticulous approach ensured that the dataset was well-organized and primed for analysis.

Exploratory Data Analysis (EDA)

undefinedundefinedundefined

With the cleaned dataset in hand, I delved into exploratory data analysis (EDA) to uncover insights. I created visualizations to illustrate key metrics such as age distribution, gender distribution, and state distribution. These visualizations provided valuable context for understanding the dataset and identifying trends.

Feature Engineering

# Convert Age column to numeric
cleaned_data['Age'] = pd.to_numeric(cleaned_data['Age'], errors='coerce')
# Create age groups
bins = [0, 18, 35, 50, 100]
labels = ['0-18', '19-35', '36-50', '51+']
cleaned_data['Age_Group'] = pd.cut(cleaned_data['Age'], bins=bins, labels=labels, right=False)
# Define a function to map states to regions
def map_state_to_region(state):
    # Regions mapping dictionary
    regions = {
        'Lagos': 'South West',
        'Abuja': 'North Central',
        'Kano': 'North West',
        'Anambra': 'South East',
        'Rivers': 'South South',
        'Ogun': 'South West',
        'Kaduna': 'North West',
        'Enugu': 'South East',
        'Delta': 'South South',
        'Ondo': 'South West',
        'Kogi': 'North Central',
        'Plateau': 'North Central',
        'Edo': 'South South',
        'Oyo': 'South West',
        'Adamawa': 'North East',
        'Nasarawa': 'North Central',
        'Ekiti': 'South West',
        'Benue': 'North Central',
        'Akwa Ibom': 'South South',
        'Kwara': 'North Central',
        'Sokoto': 'North West',
        'Bauchi': 'North East',
        'Ondo': 'South West',
        'Kebbi': 'North West',
        'Cross River': 'South South',
        'Imo': 'South East',
        'Jigawa': 'North West',
        'Gombe': 'North East',
        'Osun': 'South West',
        'Niger': 'North Central',
        'Zamfara': 'North West',
        'Bayelsa': 'South South',
        'Ebonyi': 'South East',
        'Yobe': 'North East',
        'Taraba': 'North East',
        'Borno': 'North East',
        'Katsina': 'North West',
        'FCT': 'North Central',  # Correcting Abuja to North Central
        
    }
    
    # Return the region corresponding to the state
    return regions.get(state, 'Other')

# Map states to regions
cleaned_data['Region'] = cleaned_data['State'].apply(map_state_to_region)

To enrich my analysis, I employed feature engineering techniques. I created new features, such as “Age Group” and “Region,” to categorize and segment the data further. This allowed for deeper insights into demographic trends and regional preferences.

Clustering Analysis

from sklearn.cluster import KMeans

# Prepare data for clustering
X = cleaned_data[['Age', 'Gender']]

# Convert categorical variables to numerical
X['Gender'] = X['Gender'].map({'Male': 0, 'Female': 1})

# Standardize the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cleaned_data['Cluster'] = kmeans.fit_predict(X_scaled)

# Visualize clusters
sns.scatterplot(data=cleaned_data, x='Age', y='Gender', hue='Cluster', palette='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Age')
plt.ylabel('Gender')
plt.show()

# Analyze cluster characteristics
cluster_characteristics = cleaned_data.groupby('Cluster').agg({
    'Age': ['mean', 'min', 'max'],
    'Gender': lambda x: x.value_counts().index[0],
    'State': lambda x: x.value_counts().index[0],
    'Region': lambda x: x.value_counts().index[0]
})

# Display cluster characteristics
print("Cluster Characteristics:")
print(cluster_characteristics)

# Analyze feature distribution within each cluster
feature_distribution = cleaned_data.groupby('Cluster').apply(lambda x: x['Region'].value_counts(normalize=True))
print("\nFeature Distribution within Clusters:")
print(feature_distribution)

undefined

A primary objective was to segment customers based on age and gender using clustering analysis. Leveraging the KMeans algorithm, I identified distinct customer clusters. These clusters revealed unique characteristics and preferences within the dataset, enabling me to tailor strategies to specific customer segments.

Results and Interpretation

The clustering analysis unveiled three distinct customer segments:

  1. Cluster 0 (North-West):
  • Characterized by an average age of 17, predominantly female customers.
  • Concentrated in states such as Kaduna and other Northwest regions.
  • Potential strategies: Target promotions and marketing campaigns towards female customers in Northwest states.

2. Cluster 1 (South-East):

  • Characterized by a younger demographic, with an average age of 13.
  • Predominantly female customers, with a higher proportion from Southeast states.
  • Potential strategies: Develop products or services tailored to younger female customers in Southeast states.

3. Cluster 2 (South-West):

  • Customers in this cluster have an average age of 15, with a slight male majority.
  • Even distribution across South-west and North Central states.
  • Potential strategies: Explore marketing opportunities targeting both genders in Southwest and North Central states.

Through meticulous data cleaning and analysis, I extracted valuable insights from the “Jumbled-up Customer Details” dataset. The clustering analysis revealed distinct customer segments with unique characteristics and preferences. Armed with these insights, businesses can make informed decisions and tailor strategies to better meet the needs of their customers.

Cleaned_Customer_Details.xlsx

HTML Analysis

Recommendations

  • Continuously monitor and analyze customer data to identify evolving trends and preferences.
  • Implement targeted marketing strategies based on the insights derived from clustering analysis.
  • Explore additional segmentation techniques to refine customer targeting and engagement further.
Discussion and feedback(2 comments)
comment-1349-avatar
Branislav Poljasevic
Branislav Poljasevic
28 days ago
Awesome stuff Henry. Data cleaning is the ugly duck of the analysis workflow and it so often gets neglected - a huge mistake. It's great to see that you're digging into all the aspects of the workflow. No better way to stay a well-rounded, professional analyst. Keep up the good work!
2000 characters remaining