__STYLES__
Introduction:-
Welcome to the Bellabeat data analysis case study! In this case study,I am performing many real-world tasks as a junior data analyst. I am assuming that I am working for Bellabeat, a high-tech manufacturer of health-focused products for women, and meet different characters and team members. In order to answer the key business questions.
I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
Ask:-
These questions will guide our analysis:
Business Task:-
Bellabeat is a small company that creates health-focused smart devices for women. Bellabeat’s executive and marketing teams want insight into how consumers use their smart devices in order to reveal opportunities for growth and to guide the company’s marketing strategy.
Bellabeat products and analyze smart devices usage data in order to gain insight into how people are already using their smart devices. And that insights can be applied to Bellabeat’s products.
Prepare:-
The datasets downloaded from here: https://www.kaggle.com/arashnic/fitbit This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. I am using pyhton to analyze this data.
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing,data structure and data analysis
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
import datetime as dt # date time
Reading the files
#reading the files in the csv form
data = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
#preview first 5 rows
data.head(5)
Process:-
# check null values in data
null_values = data.isnull().sum()
#show null values
null_values[:]
# check null values in data
null_values = data.isnull().sum()
#show null values
null_values[:]
Id 0
ActivityDate 0
TotalSteps 0
TotalDistance 0
TrackerDistance 0
LoggedActivitiesDistance 0
VeryActiveDistance 0
ModeratelyActiveDistance 0
LightActiveDistance 0
SedentaryActiveDistance 0
VeryActiveMinutes 0
FairlyActiveMinutes 0
LightlyActiveMinutes 0
SedentaryMinutes 0
Calories 0
dtype: int64
#get info of tha dataframe using info()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 940 non-null int64
1 ActivityDate 940 non-null object
2 TotalSteps 940 non-null int64
3 TotalDistance 940 non-null float64
4 TrackerDistance 940 non-null float64
5 LoggedActivitiesDistance 940 non-null float64
6 VeryActiveDistance 940 non-null float64
7 ModeratelyActiveDistance 940 non-null float64
8 LightActiveDistance 940 non-null float64
9 SedentaryActiveDistance 940 non-null float64
10 VeryActiveMinutes 940 non-null int64
11 FairlyActiveMinutes 940 non-null int64
12 LightlyActiveMinutes 940 non-null int64
13 SedentaryMinutes 940 non-null int64
14 Calories 940 non-null int64
dtypes: float64(7), int64(7), object(1)
memory usage: 110.3+ KB
#check the dimension of tha data
data.shape
(940, 15)
#count distinct values of Id column
print(len(pd.unique(data["Id"])))
33
#convert "ActivityDate" to datetime64 dtype
data["ActivityDate"]= pd.to_datetime(data["ActivityDate"],format="%m/%d/%Y")
#print first 5 row to confirm type of columns
data["ActivityDate"].head(5)
0 2016-04-12
1 2016-04-13
2 2016-04-14
3 2016-04-15
4 2016-04-16
Name: ActivityDate, dtype: datetime64[ns]
#create new columns "day_of_week"
data['day_of_week']=data['ActivityDate'].dt.day_name()
#create new column "month"
data['month']=data['ActivityDate'].dt.month_name()
#create new column sum of total minutes
data["total_mins"]=data["VeryActiveMinutes"]+data["FairlyActiveMinutes"]+data["LightlyActiveMinutes"]+data["SedentaryMinutes"]
#create new columns sum of total hours
data["total_hours"]=round(data["total_mins"]/60,1)
#rename columns to insert separator between words and lower the colomn's names
data.rename(columns={"Id":"id","ActivityDate":"date","TotalSteps":"total_steps"
,"TotalDistance":"total_dist","TrackerDistance":"track_dist"
,"LoggedActivitiesDistance":"logged_dist","VeryActiveDistance":"very_active_dist"
,"ModeratelyActiveDistance":"moderate_active_dist","LightActiveDistance":"light_active_dist"
,"SedentaryActiveDistance":"sedentary_active_dist","VeryActiveMinutes":"very_active_mins"
,"FairlyActiveMinutes":"fairly_active_mins","LightlyActiveMinutes":"lightly_active_mins"
,"SedentaryMinutes":"sedentary_mins","TotalexerciseMinutes":"total_mins"
,"TotalExerciseHours":"total_hours","Calories":"calories"}, inplace=True)
#create new list of rearrange columns
new_cols=['id','date','month','day_of_week','total_steps','total_dist','total_mins'
,'total_hours','calories','track_dist','logged_dist','very_active_dist'
,'moderate_active_dist','light_active_dist','sedentary_active_dist','very_active_mins'
,'fairly_active_mins','lightly_active_mins','sedentary_mins']
#reindexing function to rearrange columns based on new columns
data=data.reindex(columns=new_cols)
Analyze:-
#print first 5 row using head function to check all changes
data.head(5)
#get the new shape of dataframe
data.shape
(940, 19)
#get statistics of the data
data.describe()
Share:-
#size, style, grid
sns.set_style("whitegrid")
plt.figure(figsize=(8,4))
#set the plot
plt.hist(data.day_of_week, bins=7, width = 0.5, color="orange")
#set labels,title
plt.xlabel("Day of the week", color= 'black', size=14)
plt.xticks(rotation=45, size=14)
plt.ylabel("Frequency", color='black', size=14)
plt.title("App uses per day of week",size=20)
#show the plot
plt.show()
#size, style, grid
sns.set_style("whitegrid")
plt.figure(figsize=(4,4))
#set the plot
plt.hist(data.month, bins=3, width = 0.5, color="green")
#set labels,title
plt.xlabel("Month", color= 'black', size=14)
plt.xticks(rotation=30, size=14)
plt.ylabel("Frequency", color='black', size=14)
plt.title("App uses per month",size=20)
#show the plot
plt.show()
#size, style, grid
sns.set_style("whitegrid")
plt.figure(figsize=(8,4))
#set the plot
sns.scatterplot(data=data, x="total_hours", y="calories", hue="calories", palette= "viridis")
#set labels,titles
plt.xlabel("Number of Hours",size=15)
plt.ylabel("Calories",size=15)
plt.title("Calories Burned Per Hour",size=20)
plt.legend()
#show the plot
plt.show()
#size, style, grid
sns.set_style("whitegrid")
plt.figure(figsize=(8,4))
#set the plot
sns.scatterplot(data=data, x="total_steps", y="calories", hue= "calories" ,palette= "viridis")
#set the labels and title
plt.xlabel("Number of Steps",size=15)
plt.ylabel("calories",size=15)
plt.title("Calories Burned in Steps",size=20)
plt.legend()
<matplotlib.legend.Legend at 0x7fb543f1ce90>
#size, style, grid
sns.set_style("whitegrid")
plt.figure(figsize=(8,4))
#set the plot
sns.scatterplot(data=data, x="total_dist", y="calories", hue= "calories" ,palette= "viridis")
#set the labels and title
plt.xlabel("Total Distance",size=15)
plt.ylabel("calories",size=15)
plt.title("Calories Burned with Distance",size=20)
plt.legend()
<matplotlib.legend.Legend at 0x7fb543ec4750>
#create sum of each usage in minutes and covert into hours
very_active_mins=data["very_active_mins"].sum()/60
fairly_active_mins=data["fairly_active_mins"].sum()/60
lightly_active_mins=data["lightly_active_mins"].sum()/60
sedentary_mins=data["sedentary_mins"].sum()/60
#pie chart to show the percent size of each usage minutes
slices=[very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins]
labels=["Very Active","Fairly Active","Lightly Active","Sedendtary"]
colours=["grey", "orange", "pink", "green"]
explode=[0.1,0.1,0.1,0.1]
#size,style and title
plt.style.use("default")
plt.title("% of activity in Hours",size=20)
#set the plot
plt.pie(slices,labels=labels,colors=colours,explode=explode,autopct="%1.1f%%")
#show the plot
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
#create sum of each usage in minutes and covert into hours
very_active_dist=data["very_active_dist"].sum()
moderate_active_dist=data["moderate_active_dist"].sum()
light_active_dist=data["light_active_dist"].sum()
#pie chart to show the percent size of each usage minutes
slices=[very_active_dist,moderate_active_dist,light_active_dist]
labels=["Very Active","Moderate Active","Light Active"]
colours=["grey", "orange", "green"]
explode=[0.1,0.1,0.1]
#size,style and title
plt.style.use("default")
plt.title("% of activity of Distance",size=20)
#set the plot
plt.pie(slices,labels=labels,colors=colours,explode=explode,autopct="%1.1f%%")
([<matplotlib.patches.Wedge at 0x7fb542504c90>, <matplotlib.patches.Wedge at 0x7fb542511610>, <matplotlib.patches.Wedge at 0x7fb542511e50>], [Text(0.7715513758527146, 0.9190802328522703, 'Very Active'), Text(-0.5790909313131726, 1.051025067860345, 'Moderate Active'), Text(-0.4326482208233378, -1.1192924180116652, 'Light Active')], [Text(0.45007163591408345, 0.536130135830491, '27.8%'), Text(-0.33780304326601734, 0.6130979562518678, '10.5%'), Text(-0.2523781288136137, -0.6529205771734714, '61.7%')])
Act:-
Recommendation:-
To Access Complete Notebook:- Fitbit Wellness Tracker