__STYLES__
Tools used in this project
Bikeshare - Google Capstone

About this project

This Bike-Share program features more than 5,800 bicycles and 600 docking stations. The business objective of this project was to analyze the bike riders' bike-usage behavior, to help change the maximum number of casual bike riders to annual members, because the annual members are more profitable. I analyzed the bike riders behavior, based on the data from July 2021 to June 2022. I cleaned, processed, and organized the data with R and performed calculations, resulting in powerful visualizations.

Key Findings:

1: Casual riders took 2557887 rides in total, whereas members took 3341852 rides in total.

2: Maximum number of casual riders ride on Saturday and Sunday. Again, the reason could be the influx of visitors or tourists on the weekends, who prefer to pay-per-ride than buy memberships.

3: Although there is a high number of casual riders throughout the year, July and August are the peak months for casual bike riders. This could be due to summer tourism in those months, where the riders rent bikes temporarily for their trips instead of buying memberships.

4: The maximum number of casual riders ride on Sunday and Saturday afternoons and early evenings.

5: For members, Kingsbury St & Kinzie St are the most popular start as well as end stations.

6: For casual riders, Streeter Dr & Grand Ave are the most popular start and end stations, with Millennium Park scoring 2nd on the list.

Business Suggestions: To Act on the business insights received from this project, it is important to spend the most on marketing to target the casual bike riders during July and August, and also on the weekends. It is also necessary to target the marketing for casual riders at, at least their top three popular stations. Discounted coupons for annual memberships and Flyers pointing the benefits of annual memberships are an option to encourage the casual riders to buy annual memberships. Email marketing is not an option for us here because we were not given the access to customers' information or credit card information or emails due to legal concerns.

When I did the descriptive analysis, I also noticed that Casual riders are riding almost twice as long as the members.

Data Analysis Process

The data consisted of over 1 billion rows, which made using R a better choice to analyze it. I analyzed the days, times, and months of rides by casual riders and members, their ride lengths, total number of rides, time length of their rides etc. I also found the most popular start and end stations for both type of riders.

First, I installed and loaded tidyverse, janitor, and lubridate packages in RStudio.

#Install packages

install.packages('tidyverse')
install.packages('janitor')
install.packages('lubridate')


#Load packages

library(tidyverse)
library(janitor)
library(lubridate)

Next, I imported bike usage data from July 2021 to June 2022 into RStudio.

#Import csv files to RStudio and rename them

Jul2021 <- read_csv("Bike_data/202107-divvy-tripdata.csv")
Aug2021 <- read_csv("Bike_data/202108-divvy-tripdata.csv")
Sep2021 <- read_csv("Bike_data/202109-divvy-tripdata.csv")
Oct2021 <- read_csv("Bike_data/202110-divvy-tripdata.csv")
Nov2021 <- read_csv("Bike_data/202111-divvy-tripdata.csv")
Dec2021 <- read_csv("Bike_data/202112-divvy-tripdata.csv")
Jan2022 <- read_csv("Bike_data/202201-divvy-tripdata.csv")
Feb2022 <- read_csv("Bike_data/202202-divvy-tripdata.csv")
Mar2022 <- read_csv("Bike_data/202203-divvy-tripdata.csv")
Apr2022 <- read_csv("Bike_data/202204-divvy-tripdata.csv")
May2022 <- read_csv("Bike_data/202205-divvy-tripdata.csv")
Jun2022 <- read_csv("Bike_data/202206-divvy-tripdata.csv")

Here I created a new Dataframe by merging 12 different datasets from 12 different months. I also made sure the Date field is correctly formatted as a String.


merged_data <- bind_rows(Jul2021, Aug2021, Sep2021, Oct2021, Nov2021, Dec2021, Jan2022, Feb2022, Mar2022, Apr2022, May2022, Jun2022)

#Check field formatting
str(Jul2021)
str(Aug2021)
str(Sep2021)
str(Oct2021)
str(Nov2021)
str(Dec2021)
str(Jan2022)
str(Feb2022)
str(Mar2022)
str(Apr2022)
str(May2022)
str(Jun2022)

Here I cleaned the new merged data frame by removing extra spaces etc and removing empty columns and rows.

#Clean data & remove extra spaces, parenthesis etc.
merged_data <-clean_names(merged_data)

#Remove empty columns and rows in our dataframe 
remove_empty(merged_data, which=c())

Next, I made new columns extracting day_of_week, starting hour of the bike rides, months of the rides, and trip duration.

#to see more rows
'print(n=...)'

#Find day of week by using wday()
merged_data$day_of_week <-wday(merged_data$started_at, label=T, abbr = T)

#Use format(as.POSIXct) to extract a certain TIME HOUR FORMAT
merged_data$starting_hour <-format(as.POSIXct(merged_data$started_at), '%H')

#Extract the date using format(as.Date)
merged_data$month <-format(as.Date(merged_data$started_at), '%m')

#Find trip duration using difftime()
merged_data$trip_duration <-difftime(merged_data$ended_at, merged_data$started_at, units = 'sec')

Using the code below, I removed all the rows containing 0 for trip duration.

#Remove rows where trip_duration  <= 0
cleaned_df <-merged_data[!(merged_data$trip_duration<=0),]

I also wanted to use this cleaned and newly-created dataframe to create visualizations with Tableau. So, I used write.csv to save this dataframe as a .csv file to use with Tableau later.

#Lets export our cleaned_df using write.csv, to use it with Tableau
write.csv(cleaned_df, file='Capstone_df.csv')

Next, I created different charts as shown below to analyze and get insights from our data.

#Graph number of rides by rider type on different days of week
#Use options(scipen=) to remove scientific values from graphs
options(scipen=999)
ggplot(data=cleaned_df)+
  aes(x = day_of_week, fill = member_casual)+
  geom_bar(position = 'dodge') +
  labs(x = 'Day of Week', y = 'Number of Rides', fill = 'Member Type', title = 'Number of Rides by Member Type' )
ggsave("number_of_rides_by_member_type.png")

imageMaximum number of casual riders ride on Saturday and Sunday. Again, the reason could be the influx of visitors or tourists on the weekends, who prefer to pay-per-ride than buy memberships.

#Graph number of rides by rider type per month
ggplot(data = cleaned_df) +
  aes(x = month, fill = member_casual)+
  geom_bar ( position = 'dodge') +
  labs( x = 'Month', y = 'Number of Rides', fill = 'Member Type', title = 'Number of Rides per Month')
ggsave("number_of_rides_per_month.png")

imageThis bar chart clearly shows us that although there is a high number of casual riders throughout the year, July and August are the peak months for casual bike riders. This could be due to summer tourism in those months, where the riders rent bikes temporarily for their trips instead of buying memberships.

#Graph hourly use of bikes throughout the week by rider type
#Use Facet_wrap() a create panels
#Use element_text(size = ) to reduce text size to fit the charts
#Use dpi to save chart for higher resolution for clarity on bigger screens

ggplot(data = cleaned_df) +
  aes (x = starting_hour, fill = member_casual) +
  facet_wrap(~day_of_week) +
  geom_bar()+
  labs(x= ' Starting Hour', y = 'Number of Rides', fill = 'Member Type', title = 'Hourly use of Bikes throughout the week')+
  theme(axis.text = element_text (size = 5))
ggsave ("Hoursly_use_of_bikes_throughout_the_week.png", dpi = 1000)

imageIf we look at the above histograms, again, we see that the maximum number of casual riders ride on Sunday and Saturday afternoons and early evenings.

#Use aggregate() to calculate mean trip duration
aggregate(cleaned_df$trip_duration~cleaned_df$member_casual + cleaned_df$day_of_week, FUN = mean)

#Find ride lengths using start time and end time and name the new column ride_length

cleaned_df$ride_length <- difftime(cleaned_df$ended_at, cleaned_df$started_at)

#Lets see how are columns are structured one more time
str(cleaned_df)

#Change ride_length to numeric so we can run our calculations later
cleaned_df$ride_length <-as.numeric(as.character(cleaned_df$ride_length))

#Check if ride_length is numeric now
is.numeric(cleaned_df$ride_length)

#Lets install package and library geosphere to work with latitudes and longitudes 
install.packages ("geosphere")
library(geosphere)

#Now we find the ride distance and name the new column ride_distance
cleaned_df$ride_distance <- distGeo(matrix(c(cleaned_df$start_lng, cleaned_df$start_lat), ncol=2),
                                    matrix(c(cleaned_df$end_lng, cleaned_df$end_lat), ncol=2))
#Lets view a quick summary of data now
summary(cleaned_df)

#Calculate average, median, max and min ride lengths and group them separately for members and casual riders
cleaned_df %>%
  group_by(member_casual) %>%
    summarize(average_ride_length = mean(ride_length), median_length = median(ride_length), 
              max_ride_length = max(ride_length), min_ride_length = min(ride_length))

#Lets find out who takes more rides by calculating the total number of rides for members and casuals
cleaned_df %>%
  group_by(member_casual) %>%
  summarize(ride_count = length(ride_id))

#Lets look at the mean distance covered by members and casual riders
cleaned_df %>%
  group_by(member_casual) %>% drop_na() %>%
  summarize (average_ride_distance = mean(ride_distance)) %>%
  ggplot() +
  geom_col(mapping = aes (x = member_casual, y= average_ride_distance, fill = member_casual), show.legend = FALSE)+
  labs(title = "Average distance covered by Members and Casual Riders")
  
#Find the most popular stations using count, and filter functions and by making new dataframes for annual members and casual riders
annual_mem_df <- filter(cleaned_df, member_casual == 'member')
count(annual_mem_df, start_station_name, sort = T)
count(annual_mem_df, end_station_name, sort = T)
casual_mem_df <-filter(cleaned_df, member_casual == 'casual')
count(casual_mem_df, start_station_name, sort = T)
count(casual_mem_df, end_station_name, sort = T)

Additional project images

Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.