Skip to content
ajkrukowski edited this page Apr 17, 2015 · 11 revisions

Welcome to the BigDataTaxiFinalProject wiki!

Saturday April 4, 2015 We successfully copied all taxi data from 2010 to 2013 onto CUSP's cluster. Had issues using wget with https://uofi.app.box.com/NYCtaxidata to copy files directly from website to CUSP shell. Our theory is that it's b/c of security measures on box.com's site. We had to copy all zip files onto local machines in PP and then copy to shell and finally cluster. We used decompress.py in https://uofi.app.box.com/NYCtaxidata to decompress all folders and files.

Plan for Sunday: use map.py, reduce.py files from HW3, Task 1 to join all trip and fare data.

Monday April 6, 2015

Data Preprocessing

Gave access to the akabd directory to Amanda and Andrea on the NYU CUSP Cluster. Ran map.py and reduce.py scripts on the all FOIL directories to join the trips and fares data using the key attributes: medallion, hack_license, vendor_id, and pickup_datetime. For our first TripFareJoin, we are using 4 reducers. Its been running for 45 mins; waiting to see if it worked properly.

The mapper and reducer have some built-in data cleaning. We control mapper and reducer and use an inner join to ensure that each record appears in both tables.

Analysis

We are going to measure the change in taxi drop-offs at a specific intersection from 2010 to 2013.

Using a larger regional aggregation (zip code, neighborhood) will remove the geo-spatial granularity. Using intersections still allows for aggregation for analysis but keeps the finer geographic detail. Aggregating intersections also builds in data cleaning--we eliminate records that do not have a pick-up/drop-off or have a pick-up/drop-off on non-existent streets (ie, in the river or in the middle of Central Park).

The drop-offs we will be focusing on are those that take place on Thursday, Friday, and Saturday from 8 pm to 2 am. This time window was selected because we wanted to capture young professionals going home after a night in the town. :) We are also assuming that people of limited resources will not take a taxi home.

We are also going to look at taxi pick-ups between 7am and 10am during business days and do a similar analysis.

Also, we will try to validate our findings by looking at the change in real estate prices within a chosen radius of the intersect. The radius that will be used has not be defined yet.

Required Data: Still need to acquire a csv file with all the lat/lng for all intersections in NYC.

April 7

Acquired the center line of all street segments in a shapefile from https://data.cityofnewyork.us/City-Government/CSCL-PUB-Centerline/exjm-f27b. Parsed the shapefile to pull all lats and longs for every street segment. However, not sure which points are intersections. Attempted approaches: tried only first and last points, first points, all points for each street segment list, points that appeared more than once in every point for manhattan streets.

April 9

Managed to extract the intersections from the NYC street shapefile. Wrote a script that looked at the start and end point of every linestring (street) and saved it into a dictionary and stored the count across the entire file. If the exact lat/lng appeared more than once, we assumed that it was an intersection - meaning it is where the start point of one streets meets the end point of another.

Currently, we have ~62k intersections across all of NYC. We wrote several scripts to compare results and were off by a small margin.
The number of intersections was also validated against spatial analysis done in ArcMap. An intersection is defined as the location where two or more start or end vertices of a line intersect. Using this method of analysis it is possible to exclude "T" intersection where a one road intersects the mid-point of another street. Additionally, because "Z" or elevation is not taken into consideration false intersections are created where there are flyovers, underpasses, or tunnels; the geometry of the line segments intersect but the in read life there is no intersection in these locations.

We are now writing another map-reduce job that will take the output from our TripFareJoin.output results and append the intersection that is closest to the taxi drop-off. We will be building out kd-trees for this exercise. We hope to then explore ways to see the change in drop-offs over time to any specific intersection across neighborhoods.

Questions about the map-reduce job that will take the output from our TripFareJoin.output results and append the intersection that is closest to the taxi drop-off: Will it append the closest intersection regardless of how close the intersection is to the taxi drop-off location? Will a radius be implemented so taxi drop-off locations that are in the river do not get assigned an intersection? If a radius is assigned what will how will overlap be handled/ what happens if a valid drop-off location falls outside of the radius?

Never-mind I figured out how overlaps will not be an issue: To relate each taxi trip between 2010 and 2013 to an intersection the nearest intersection to a record’s drop-off location within a defined radius will be appended to each taxi record.

April 14

Built new mapper/reducer--moved KD tree to mapper instead of reducer. Structured mapper to use yield as opposed to print b/c yield is faster b/c it's a generator. (for more info -- http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/)

Ran map/reduce jobs to get evening dropoffs per intersection. Edited mapper to filter for morning pickups -> ran kd tree on morning pickups, counted number of pickups between 7 and 10am on weekday mornings for each intersection.

April 17 Exploratory data analysis for morning pickups and evening dropoffs. Put yearly totals for each intersection into pandas df. Dropped any intersection that had 0 trips for any year. Found mean and median, as well as quartiles of data of absolute values.

Morning Pickups: Looking at summary statistics of yearly counts per intersection (absolute values), we saw data was very left-skewed with long right tail--many intersections with fewer than 10 pickups while maximum number of pickups was almost 300,000. We saw a drop in absolute values in 2013 (Uber?). We plotted histograms of log-scaled counts to look at distributions. Morning pickups were approximately bimodal Gaussian/normal for 2010 and 2011. Saw a drastic change in distribution in 2012 and 2013. In 2013, data was even more left-skewed. Had many more intersections that had very few pickups. Decided that morning pickups may not be the best focus for analysis b/c of drastic change in patterns and will focus on evening pickups and dropoffs.

Ran map/reduce on evening pickups to get KD Tree and count of pickups per intersection. Evening Dropoffs: We plotted histograms of log-scaled counts to look at distributions--these were consistent between 2010 and 2013 and fairly normally distributed (slight right tail but we're going to cut off outliers).

Clone this wiki locally