Datasets

The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:

Hadoop HDFS: Can be found at /datasets/*
Namenode local filesystem: Can be found at /mnt/datasets/*
HackReduce Github project: Samples found in the datasets/* folder of the project. Note: not all the datasets listed on this page will have samples in the Github project.

There's also the possibility of loading new data at the event, but this process could take a few hours. Please see Greg about loading new data into your clusters.

If you're looking for data sets, below are a few good places to start:

Quora answer wiki for Where can I get large datasets open to the public?
The Data Hub data catalog
The Visua.ly Blog has a great article called 30 Places to Find Open Data on the Web.

If you're looking for Toronto data sets here are a couple of places to start:

Buzzdata has many city-related data sets rounded up here.
Global news's data set of Toronto parking tickets is here, again courtesy of Buzzdata.

Million Song Dataset [datasets/msd]

http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset

Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us

Freebase [datasets/freebase]

Quad dump (http://wiki.freebase.com/wiki/Data_dumps#Quad_dump) [datasets/freebase/quadruples]
Simple topic dump (http://wiki.freebase.com/wiki/Data_dumps#Simple_Topic_Dump) [datasets/freebase/topics]

NASDAQ daily prices and dividends [datasets/nasdaq]

http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange

NYSE daily prices and dividends [datasets/nyse]

http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange

Wikipedia XML dump [datasets/wikipedia]

http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

Google Ngram [datasets/ngrams]

Only the 1-gram and 2-gram datasets are available
http://ngrams.googlelabs.com/datasets

International Cancer Genome Consortium [datasets/icgc]

Data format documentation: http://dcc.icgc.org/pages/docs/ICGC_Data_Submission_Manual-0.6b-Unextended.pdf

Geonames [datasets/geonames]

http://download.geonames.org/export/dump/readme.txt

GigaFrEn corpus [datasets/fre-eng]

French: datasets/fre-eng/fre
English: datasets/fre-eng/eng
http://www.statmt.org/wmt09/translation-task.html

Mate1 various data [datasets/mate1]

Provided by the Mate1 team
Includes (with all personally identifiable data excluded, of course):
- Profiles: datasets/mate1/profile
- Iinternal messages: datasets/mate1/internal_message
- Subscriptions: datasets/mate1/subscription
- Who's seen who: datasets/mate1/whos_seen_who
- Hot block list: datasets/mate1/hot_block_list
Take a look at the datasets/mate1/*-cols.txt files for a description of the CSV fields for each dataset.

Reddit voting data [datasets/reddit]

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/

Bixi Bikesharing Data 2012 [datasets/bixidata]

Includes Bixi (branded differently in other cities) data for Toronto, Ottawa and Boston (Hubway).
Updated by Julia Evans and Kamal Marhubi for HackReduce Montreal 2012
Dataset location on the namenode local filesystem is: /mnt/bixidata
Additional Boston Hubway data sets available from: http://hubwaydatachallenge.org including shapefiles, an aggregated rebalancing data sample, Apr - Sep 2012, and station status data, with available bikes and empty docks per station by minute back to August 2011 (30 million records)

Bixi Montreal [datasets/bixi]

XML dump of all the bike station information queried every minute over a couple of months.
Provided by Fabrice (http://twitter.com/f8full)

DNS dataset [datasets/dns]

Contains the root file with all the domain names and their associated nameservers for the "com" TLD.

LDEO Surface Ocean CO2 Climatology data [datasets/ldeo]

http://www.pmel.noaa.gov/co2/story/LDEO+Surface+Ocean+CO2+Climatology

Twitter dataset [datasets/twitter]

Data of the social graph, user id to names, and selected celebrity profiles. This does not contain actual tweets because of Twitter policies.
http://an.kaist.ac.kr/traces/WWW2010.html

Flight dataset [datasets/flights]

Limited set of flight data containing origin, destination, departure time, return time, price and date.
Only has flights originating from SEA
Provided by Hopper

Amazon dataset [datasets/amazon]

Description of data formats: http://131.193.40.52/data/README.txt
Data listing: http://131.193.40.52/data/

IMDB dataset [datasets/imdb]

ftp://ftp.fu-berlin.de/pub/misc/movies/database/

Taylor Tweets [datasets/taylor_tweets]

Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset was a search of all tweets containing the word "taylor" in them.
JSON format

Citation networks [datasets/citation-networks]

Arxiv HEP-PH (high energy physics phenomenology) [datasets/citation-networks/hep-ph/{dates,graph}]: http://snap.stanford.edu/data/cit-HepPh.html
Arxiv HEP-TH (high energy physics theory) [datasets/citation-networks/hep-th/{dates,graph}]: http://snap.stanford.edu/data/cit-HepTh.html
U.S. patent dataset: http://snap.stanford.edu/data/cit-Patents.html

Shopify data [datasets/shopify]

All the data for the U.S. specifying origin/destination of orders from our system, including price and date.

Common Crawl web crawl

Web crawl corpus of 5 billion pages (60TB) in ARC file format http://aws.amazon.com/datasets/41740
Divided into three major subsets:

Current Crawl - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Datasets

Million Song Dataset [datasets/msd]

Freebase [datasets/freebase]

NASDAQ daily prices and dividends [datasets/nasdaq]

NYSE daily prices and dividends [datasets/nyse]

Wikipedia XML dump [datasets/wikipedia]

Google Ngram [datasets/ngrams]

International Cancer Genome Consortium [datasets/icgc]

Geonames [datasets/geonames]

GigaFrEn corpus [datasets/fre-eng]

Mate1 various data [datasets/mate1]

Reddit voting data [datasets/reddit]

Bixi Bikesharing Data 2012 [datasets/bixidata]

Bixi Montreal [datasets/bixi]

DNS dataset [datasets/dns]

LDEO Surface Ocean CO2 Climatology data [datasets/ldeo]

Twitter dataset [datasets/twitter]

Flight dataset [datasets/flights]

Amazon dataset [datasets/amazon]

IMDB dataset [datasets/imdb]

Taylor Tweets [datasets/taylor_tweets]

Citation networks [datasets/citation-networks]

Shopify data [datasets/shopify]

Common Crawl web crawl

Clone this wiki locally