__STYLES__
In this report, I’ll detail my journey to transform a disorderly dataset into actionable insights through rigorous data cleaning and analysis. The dataset, titled “Jumbled-up Customer Details,” initially posed challenges with its mixed-up information, missing values, and inconsistent formats. However, through systematic data cleaning and thorough analysis, I unearthed valuable insights that can inform strategic decision-making.
The Messy Data
Initial messy data
The “Jumbled-up Customers Details” dataset comprised a tangled mess of names, addresses, ages, and genders. This complexity hindered the extraction of meaningful information and insights.
Data Cleaning Process
# Define a function to split the jumbled data
def split_customer_details(row):
data = row[0]
parts = data.split('Address')
name_part = parts[0].split('Name')[1].strip()
address_part = 'Address' + parts[1]
address_parts = address_part.split('Age')
address = address_parts[0].replace('Address', '').strip()
age_gender_part = 'Age' + address_parts[1]
age_gender_parts = age_gender_part.split('Gender')
age = age_gender_parts[0].replace('Age', '').strip()
gender = age_gender_parts[1].replace('Gender', '').strip()
return pd.Series([name_part, address, age, gender])
# Define a function to extract state from address
def extract_state(address):
# List of Nigerian states
nigerian_states = [
'Abia', 'Adamawa', 'Akwa Ibom', 'Anambra', 'Bauchi', 'Bayelsa', 'Benue', 'Borno', 'Cross River',
'Delta', 'Ebonyi', 'Edo', 'Ekiti', 'Enugu', 'FCT', 'Gombe', 'Imo', 'Jigawa', 'Kaduna', 'Kano',
'Katsina', 'Kebbi', 'Kogi', 'Kwara', 'Lagos', 'Nasarawa', 'Niger', 'Ogun', 'Ondo', 'Osun', 'Oyo',
'Plateau', 'Rivers', 'Sokoto', 'Taraba', 'Yobe', 'Zamfara'
]
# Dictionary of cities and their corresponding states
city_states = {
'Onitsha': 'Anambra',
'Abeokuta': 'Ogun'
}
if 'Abuja' in address:
return 'FCT'
for state in nigerian_states:
if state in address:
return state
for city, state in city_states.items():
if city in address:
return state
# If no state is found, return 'Unknown'
return 'Unknown'
# Apply the function to each row
cleaned_data = Customer_details.apply(split_customer_details, axis=1)
# Rename the columns
cleaned_data.columns = ['Name', 'Address', 'Age', 'Gender']
# Extract state from each address
cleaned_data['State'] = cleaned_data['Address'].apply(extract_state)
I began by implementing a comprehensive data-cleaning process. Custom functions were developed to disentangle the data, splitting it into distinct columns, extracting crucial details like age, gender, and state from addresses, and addressing any missing values. This meticulous approach ensured that the dataset was well-organized and primed for analysis.
Exploratory Data Analysis (EDA)
With the cleaned dataset in hand, I delved into exploratory data analysis (EDA) to uncover insights. I created visualizations to illustrate key metrics such as age distribution, gender distribution, and state distribution. These visualizations provided valuable context for understanding the dataset and identifying trends.
Feature Engineering
# Convert Age column to numeric
cleaned_data['Age'] = pd.to_numeric(cleaned_data['Age'], errors='coerce')
# Create age groups
bins = [0, 18, 35, 50, 100]
labels = ['0-18', '19-35', '36-50', '51+']
cleaned_data['Age_Group'] = pd.cut(cleaned_data['Age'], bins=bins, labels=labels, right=False)
# Define a function to map states to regions
def map_state_to_region(state):
# Regions mapping dictionary
regions = {
'Lagos': 'South West',
'Abuja': 'North Central',
'Kano': 'North West',
'Anambra': 'South East',
'Rivers': 'South South',
'Ogun': 'South West',
'Kaduna': 'North West',
'Enugu': 'South East',
'Delta': 'South South',
'Ondo': 'South West',
'Kogi': 'North Central',
'Plateau': 'North Central',
'Edo': 'South South',
'Oyo': 'South West',
'Adamawa': 'North East',
'Nasarawa': 'North Central',
'Ekiti': 'South West',
'Benue': 'North Central',
'Akwa Ibom': 'South South',
'Kwara': 'North Central',
'Sokoto': 'North West',
'Bauchi': 'North East',
'Ondo': 'South West',
'Kebbi': 'North West',
'Cross River': 'South South',
'Imo': 'South East',
'Jigawa': 'North West',
'Gombe': 'North East',
'Osun': 'South West',
'Niger': 'North Central',
'Zamfara': 'North West',
'Bayelsa': 'South South',
'Ebonyi': 'South East',
'Yobe': 'North East',
'Taraba': 'North East',
'Borno': 'North East',
'Katsina': 'North West',
'FCT': 'North Central', # Correcting Abuja to North Central
}
# Return the region corresponding to the state
return regions.get(state, 'Other')
# Map states to regions
cleaned_data['Region'] = cleaned_data['State'].apply(map_state_to_region)
To enrich my analysis, I employed feature engineering techniques. I created new features, such as “Age Group” and “Region,” to categorize and segment the data further. This allowed for deeper insights into demographic trends and regional preferences.
Clustering Analysis
from sklearn.cluster import KMeans
# Prepare data for clustering
X = cleaned_data[['Age', 'Gender']]
# Convert categorical variables to numerical
X['Gender'] = X['Gender'].map({'Male': 0, 'Female': 1})
# Standardize the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cleaned_data['Cluster'] = kmeans.fit_predict(X_scaled)
# Visualize clusters
sns.scatterplot(data=cleaned_data, x='Age', y='Gender', hue='Cluster', palette='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Age')
plt.ylabel('Gender')
plt.show()
# Analyze cluster characteristics
cluster_characteristics = cleaned_data.groupby('Cluster').agg({
'Age': ['mean', 'min', 'max'],
'Gender': lambda x: x.value_counts().index[0],
'State': lambda x: x.value_counts().index[0],
'Region': lambda x: x.value_counts().index[0]
})
# Display cluster characteristics
print("Cluster Characteristics:")
print(cluster_characteristics)
# Analyze feature distribution within each cluster
feature_distribution = cleaned_data.groupby('Cluster').apply(lambda x: x['Region'].value_counts(normalize=True))
print("\nFeature Distribution within Clusters:")
print(feature_distribution)
A primary objective was to segment customers based on age and gender using clustering analysis. Leveraging the KMeans algorithm, I identified distinct customer clusters. These clusters revealed unique characteristics and preferences within the dataset, enabling me to tailor strategies to specific customer segments.
Results and Interpretation
The clustering analysis unveiled three distinct customer segments:
2. Cluster 1 (South-East):
3. Cluster 2 (South-West):
Through meticulous data cleaning and analysis, I extracted valuable insights from the “Jumbled-up Customer Details” dataset. The clustering analysis revealed distinct customer segments with unique characteristics and preferences. Armed with these insights, businesses can make informed decisions and tailor strategies to better meet the needs of their customers.
Recommendations