Divvy Bikes Case Study

Table of Contents

About The Project
Folder Structure
Cloning the project
Usage
Challenges
Contact
Acknowledgments

About The Project

If you want to take a direct look at this data analysis, you can do so by checking right here.

It is hosted on RPubs, a service provided by RStudio, where users can publish their R Markdown documents online.

This is a case study focusing on the DivvyBikes rental service. I conducted this analysis to apply the skills I acquired through the Google Data Analytics Certificate program hosted on Coursera. The main objective of this study was to identify differences between members and casual users and propose strategies to boost subscription rates for Divvy services.

The analysis was done using R, a statistical computing programming language. The main library used to make this analysis is Tidyverse, as this library has all the essential tools for data science. The other libraries used for this project are skimr for summarizing data, RANN used for fiding the closest station to a user, patchwork for organizing some plots, RColorBrewer for different color palettes, and tibble for data frames.

Aside from a programming language, I used two other tools for this study. The first one I used is Google My Maps to create a map of the users' favorite stations. And the second one, is Tableau, for creating the heatmaps at the end of the analysis.

(back to top)

Folder Structure

In the root of the project, the main file is CompteRendu.Rmd as this is the final result of the analysis and is the file hosted on RPubs. There's also CompteRendu.html which is the same file but as a HTML file. After this, you have this README.md file, a .gitignore file, a .Rhistory file, and DivvyBikesCaseStudy.Rproj for configuration.

The rest is organised in folders :

Assets/ has the images used for this README.md file
Data/ has the data used for this study in two folders. First in Chicago Shapefile/ there is the shapefile used on Tableau for creating the heatmaps, and Study Data/ is where the .csv files used for the study are supposed to go
Document/ has the presentation of the case study
Output/ has the data created during this analysis, also in two different folders. In Data/ there are the .rds file with the cleaned data, and the .csv files used to create the heatmaps using Tableau. And in Maps/ there are the maps created for this study
The rsconnect/ folder was created for uploading this analysis on RPubs
And the Script/ folder has the main R script used to perform the analysis

That's it for the folder structure !

(back to top)

Cloning the project

If you want to clone this project you just need Git.

To clone this project, first go in the directory you want to install the project in :
```
cd path/to/your/directory
```

And then run this command to clone the project :

git clone https://github.com/GWartelle/DivvyBikesCaseStudy.git

Usage

Once the project is cloned, if you want to replicate this analysis, the first thing you have to do is to download the .csv files used for this study. You can do so right here.

This study was done using the data of a full year, so invite you to download the twelve latest .csv files, as they are organised by month.

Once you downloaded the necessary files, you must put them in the Data/Study Data/ folder, otherwise the script won't work correctly.

And when this is done, the next thing to do is to open in your editor the DivvyBikesDataAnalysis.R file (in the Script/ folder).

Normally this should work on any modern IDE, but I would recommend using either RStudio (now called posit), or Jupyter, as these will allow you to easily run the script one chunk of code at a time.

In the script there is also something you'll have to set up :

# Setting the working directory
setwd("")
getwd()

In this part of the code, between the "" of setwd(""), you'll have to precise the exact directory in which you cloned the project, otherwise the code won't work.

Once this is done you can start running the code of this script. I recommend doing it one chunk of code at a time to better understand the process of the analysis. Also, as this analysis is dealing with a massive amount of data, running the whole script in one go will be quite long.

By following the script, you'll eventually go through all the analysis performed on this study's data.

If you wish to present this analysis, you can incorporate it into a Notebook such as a R Markdown Document, or a Jupyter Notebook.

And if you're feeling extra-courageous you could even use the three .csv files created at the end of the script, and stored in the Output/Data/ folder, to create heatmaps using Tableau (or any other BI tool of your choice).

Otherwise, feel free to use the heatmaps I created in the Output/Maps/ folder.

Lastly, if you end up with different graphs and results than me, it is absolutely normal as you will probably not use the same data as me, depending on the .csv files you choose to download from the divvy-tripdata.

(back to top)

Challenges

Of course, performing this analysis presented its share of challenges. I won't go over all of them, as they are presented in the CompteRendu.md file for the most part, but I would like to show you the most interesting part I had to deal with in this analysis.

The issue I had was that a quarter of the data I used was missing the station's name, either the one at the start of a trip or the end one, and even both for some trips. As I would have lost a lot of useful data, I decided to fix this.

After a bit of investigation I found that this was caused by trips started and/or ended outside of a station, as users can use the integrated bike lock to secure their bike. By doing so, users trigger a start and/or end of trip event without registering any station name.

To fix this I decided to replace these missing values with the name of the closest station, and to do so I collected all the station names and their coordinates in my database :

true_stations <- cbind(station_id, station_name, station_lat, station_lng, station_location) %>%
  count(station_id, station_name, station_lat, station_lng, station_location) %>%
  na.omit() %>%
  group_by(station_id) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>%
  group_by(station_name) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>%
  select(-n)

Here I assigned the id, name, latitude, longitude and combined coordinates to a new true_stations dataset, and then grouped the whole set by station name, only keeping the most frequent coordinates for each station (there were slight variations of coordinates for a given station).

Next, I created subsets of data with only the rows with missing data :

start_missing <- clean_data %>%
  select(ride_id, start_station_name, start_lng, start_lat) %>%
  rename(start_missing_name = start_station_name) %>%
  filter(is.na(start_missing_name))

end_missing <- clean_data %>%
  select(ride_id, end_station_name, end_lng, end_lat) %>%
  rename(end_missing_name = end_station_name) %>%
  filter(is.na(end_missing_name))

Once this was done, I could tackle the most interesting part, finding the closest stations :

start_coords <- as.matrix(start_missing[, c("start_lng", "start_lat")])
end_coords <- as.matrix(end_missing[, c("end_lng", "end_lat")])
true_coords <- as.matrix(true_stations[, c("station_lng", "station_lat")])

start_neighbors <- RANN::nn2(true_coords, start_coords, k = 1)$nn.idx
start_nearest_stations <- true_stations$station_name[start_neighbors]
start_missing$start_missing_name <- start_nearest_stations

end_neighbors <- RANN::nn2(true_coords, end_coords, k = 1)$nn.idx
end_nearest_stations <- true_stations$station_name[end_neighbors]
end_missing$end_missing_name <- end_nearest_stations

Here, after collecting the coordinates of the missing stations, and the ones from all the stations combined, the only thing left to do was to compare them.

In start_neighbors <- RANN::nn2(true_coords, start_coords, k = 1)$nn.idx, the RANN::nn2 function compares the two matrices, and find the index of the nearest point in space (aka coordinates) in true_coords for each point in start_coords, and stores them all in start_neighbors.

Using the indeces stored in start_neighbors, the next line of code assigns the corresponding station names in true_stations to a new variable, start_nearest_stations. After this, the following line updates the names of the missing start stations with those stored in start_nearest_stations.

Then I just had to repeat this process with the missing end stations.

Finally, the last thing to do was to update the main dataset with these newly gained station names :

filling_stations <- full_join(start_missing, end_missing, by = "ride_id") %>%
  select(ride_id, start_missing_name, end_missing_name)

clean_data <- clean_data %>%
  left_join(filling_stations, by = "ride_id")

clean_data$start_station_name <- ifelse(is.na(clean_data$start_station_name),
                                        clean_data$start_missing_name,
                                        clean_data$start_station_name)

clean_data$end_station_name <- ifelse(is.na(clean_data$end_station_name),
                                      clean_data$end_missing_name,
                                      clean_data$end_station_name)

clean_data <- clean_data %>%
  select(-start_missing_name, -end_missing_name)

Here, after combining the updated start_missing and end_missing datasets into a new filling_stations dataset, I joined it to the main dataset based on the ride_id.

Then, I simply changed the start station names with the ones from the filling_stations if the original one was NA, and kept the original if not.

After repeating this for the end station names, the only thing left to do was to get rid of the joined columns from filling_stations, to clean up the original dataset.

And with this, I basically saved a quarter of my dataset !

(back to top)

Contact

If you want to see more of my work, I invite you to go check my portfolio.

You can also take a look at my other projects on my github.

And if you'd like to get in touch with me, feel free to reach out on LinkedIn.

(back to top)

Acknowledgments

As mentionned above, this data analysis was done as part of Google's Data Analytics Certificate Cursus.

So I would like to thank Google for creating such an impressive course, and also Coursera for providing so much amazing content to the world.

And of course I would like to thank you for taking the time to read through all this ! I wish you the best 😁

Have a great day 😉

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Divvy Bikes Case Study

About The Project

Folder Structure

Cloning the project

Usage

Challenges

Contact

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Assets		Assets
Data		Data
Documents		Documents
Output		Output
Script		Script
rsconnect/documents/CompteRendu.html/rpubs.com/rpubs		rsconnect/documents/CompteRendu.html/rpubs.com/rpubs
.Rhistory		.Rhistory
.gitignore		.gitignore
CompteRendu.Rmd		CompteRendu.Rmd
CompteRendu.html		CompteRendu.html
DivvyBikesCaseStudy.Rproj		DivvyBikesCaseStudy.Rproj
README.md		README.md

GWartelle/DivvyBikesCaseStudy

Folders and files

Latest commit

History

Repository files navigation

Divvy Bikes Case Study

About The Project

Folder Structure

Cloning the project

Usage

Challenges

Contact

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages