Skip to content

Data analysis and vizulation of Ford Go Bike Sharing System

Notifications You must be signed in to change notification settings

RichmondsTetteh/FordGoBikeSystem

Repository files navigation

FordGoBikeSystem

Data analysis and vizulation of Ford Go Bike Sharing System

Dataset

The dataset I used was the Ford GoBike System Data. There were initially 183,412 entries in the dataset with 16 features (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender and bike_share_for_all_trip) but after assessing and cleaning my dataset there are 174952 entries with 16 features in the dataset. Most variables are numeric in nature, but the variables start_time, end_time, start_station_name, end_station_name, member_gender and bike_share_for_all_trip are non-numeric.

Summary of Findings

I first created a histogram chart for duration since it is a numeric variable. The initial plot shows that duration follows a highly skewed distribution. This called for a need of log scaling. I performed log scaling of duration before plotting again. Under a log scale, I observed that the data for the duration is roughly unimodal with a large peak somewhere between 400 and 600. The duration distribution also seems to just cut off at its maximum, rather than declining in a smooth tail.

I moved on to explore the other three categorical variables; user_type, member_ gender and bike_share_for_ all_ trip. From the bar chart of the user_type distribution it shows that subscribers have the highest number of count making use of the bike sharing system while the customer type had the lowest number of count of users using the bike sharing system.The bar chart of the member_gender distribution shows that the male gender have the highest number of count making use of the bike sharing system followed by the female gender with a relatively smaller count of users using the bike sharing system. The other gender had the lowest number of count of users using the bike sharing system.The bar chart of the bike_share_for_all_trip distribution:shows that the answer No has the highest number of count of individuals who refused to share their bike for all trip while using the bike sharing system while the answer Yes had the lowest number of count of individuals who shared thier bike for all trip while using the bike sharing system.

For the bivariate, I investigated the relationships between pairs of variables in my data. The initial plot of duration vs user_type showed that there appeared to beno relationships between the categorical variables and the numeric value of interest which is duration. I then performed a log transformation of duration again and made a new plot. For the violin plot of Log_duration vs User_type it showed that the box plot elements show that the median log duration for subscribers is lower than that of the customers when compared with each other. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates that the duration of both the customer and subscriber are highly concentrated around the median. The subscriber has a more elongated distribution at one end while that of the customer has a more elongated distribution at both ends. For the violin plot of Log_duration vs Member_gender it showed that the box plot elements show that the median log duration for males is the lowest followed by that of the other gender and then than that of the female gender when compared with each other. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates that the duration for the male gender, other gender and female gender are highly concentrated around the median. The other gender has a more elongated distribution at one end which indicates that this particular gender had individuals recording a much higher duration while riding their bikes. For the violin plot of Log_duration vs Bike_share_for_all_trip it showed that the box plot elements show that the median log duration for the answer Yes is lower than that of the No when compared with each other. The shape of the distribution (extremely skinny on each end and wide in the middle) indicates that the duration of both the Yes and No are highly concentrated around the median. They both have a more elongated distribution at one end. The box plots of the Log_duration vs categorical variables indicate that the box elements are relatively short, that means that the data is more compact and they do have some outliers. The plot of the full data using a violin plot and variable transformation of the duration reveals much more than the earlier box plots.

I then proceeded to look at the relationship between the three categorical variables on bivariate plots. From the bar plot of user_type vs member_gender, I observed that the subscribers have the most members with the highest count spread across the various genders. The male gender is the most dominating gender by count or frequency for both the user types followed by the female gender and the other gender. From the bar plot of user_type vs bike_share_for_all_trip; I observed that for the customer type all of the customers do not share thier ride for all trips while it is also observed that most subscriber members also do not share their rides but the difference is that a few subscriber members do share their ride for all trips. From the bar plot of bike_share_for_all_trip vs member_gender; I observed in this plot that majority of the three classified genders do not share their bikes for all trips. The male gender has a count of over 100,000 members who do not share their ride and a count lesss than 25,000 for those who share their ride.

I then went on to explore the three categorical variables and their relationship with duration using multivariate exploration. From the bar plot of durations, user_type and member_gender; I observed that all the genders in the customer category that patronize the bike system record the highest number of duration riding their bike. The other gender have the highest frequency or count for the duration recorded for both the customers and subscribers.From the bar plot of durations, member_gender and bike_share_for_all_trip; It is also observed that the other gender has the highest frequency for the duration spent for riders who share their bike for all trip and those that do not share their bike for all trip. The male gender has the lowest frequency for both riders who share their bike for all trip and those that do not share their bike for all trip. From the graph I observed that the three gender categories the highest number occuring frequency for those that do not share their ride for all trip is greater than those that do share their bike for all trip.

Key Insights for Presentation

Distribution of the log duration of bike ride

Distribution of the user type vs member gender

Bar plot of durations, user_type and member_gender

Bar plot of durations, member_gender and bike_share_for_all_trip

About

Data analysis and vizulation of Ford Go Bike Sharing System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published