Analyzing New York City Taxi Data: a MapReduce approach

Objective: Understand the taxi transportation dynamics for New York City (NYC) and how has it been impacted by Uber with the purpose of creating a more informed policy-making regarding mobility in NYC.

Using MapReduce to analyze Taxi and Uber data from NYC.

Dependencies:

Python 2.7
numpy
pickle
csv
MrJob
haversine
json
requests
datetime

Task A: Analyzing taxi demand in big concerts

0. Getting Data

Concerts

We obtained a database of 325 concerts in NYC, ranging from 2009 to 2015, using the Bandsintown API
We manually verified coordinates of several venues, as they had some defaults that did not match any known venue
We used “get_bands.py” to get the information in a csv format and turned it into a json file

Taxi rides

We downloaded two types of data: uber rides and yellow cab rides.
We took advantage of scripts to download monthly files of yellow cab rides, from 1/2009 to 12/2015. We also obtained monthly uber rides for the period of 4/14-9/14.
Data was uploaded to an S3 bucket

1. Counting taxi rides by event

We counted how many taxi rides occurred in a three-hour frame since the beginning of each event (as marked by the API), at a distance no greater than 200 meters from the venue coordinates.
Running one month file with 20 instances on AWS takes about 24 minutes (e.g. using python3 ~/…/map_taxi_events.py -r emr s3://…/yellow_tripdata_2013-03.csv )

2. Comparing taxi demand before and after Uber started operations in NYC

Results were separately analyzed using a spreadsheet. We divided total counts by total capacity of venues (in the case of Madison Square Garden) to compare before and after Uber operations.

Task B: Destination likelihood

0. Getting Data

Manhattan

We obtained the coordinate points for the polygon for Manhattan from here

Taxi rides

We used the same information as the previous task.

1. Clustering with K-Means

For each year (2009-2015), identify a set of cluster centroids (start with K=10) for taxi Pickup and Drop-off locations during three time categories: Weekday daytime, weekday nighttime, and weekends. We only look at trips that start and end within Manhattan.
Kmeans code via uchicago-cs/cmsc12300

2. Trip probability

For each trip starting and ending in Manhattan, determine to which pickup and drop-off cluster does it belong. Reduce on pickup locations and break this down into 30 minute increments. We then calculate the probability (as a relative frequency) of going to any given drop off cluster at that time from that starting region.
Look at how the probability of different destinations changes throughout the day from different starting points (e.g. “If I’m in Times Square at midnight, where am I likely to go?” versus “If I’m in Times Square at 7pm…?” How is this different on a weekend versus a weekday?

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
concerts		concerts
kmeans		kmeans
labeling_rides		labeling_rides
pictures		pictures
test_data		test_data
.gitignore		.gitignore
Analyzing_NYC_Taxi_data.pdf		Analyzing_NYC_Taxi_data.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing New York City Taxi Data: a MapReduce approach

Dependencies:

Task A: Analyzing taxi demand in big concerts

0. Getting Data

1. Counting taxi rides by event

2. Comparing taxi demand before and after Uber started operations in NYC

Task B: Destination likelihood

0. Getting Data

1. Clustering with K-Means

2. Trip probability

About

Releases

Packages

Contributors 2

Languages

License

ladyson/123bigdata

Folders and files

Latest commit

History

Repository files navigation

Analyzing New York City Taxi Data: a MapReduce approach

Dependencies:

Task A: Analyzing taxi demand in big concerts

0. Getting Data

1. Counting taxi rides by event

2. Comparing taxi demand before and after Uber started operations in NYC

Task B: Destination likelihood

0. Getting Data

1. Clustering with K-Means

2. Trip probability

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages