Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.

If you would like to use a dataset that you don't see listed here, please submit an issue to add the dataset to this table. In order to qualify for Slingshot, a dataset should generally be a public good dataset, be accessible to everyone, and not require any special permissions to access. If you are using your own data that you are willing to make public but does not have a source URL, then please share a link to download it in the Link to Dataset field.

Current qualifying datasets

In an effort to continue diversifying the data being onboarded onto the network, the list of qualifying datasets changes over time as participating teams onboard more data onto the network! Datasets that qualified in previous phases of Slingshot and no longer qualify as listed separately below.

Name	Description	Size	Format	URL
CCAFS-Climate Data	High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.	6.8 TiB	Various	http://www.ccafs-climate.org/
ECMWF ERA5 Reanalysis	ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service.	7.6 TiB	.nc	https://apps.ecmwf.int/datasets/
Genomic Data Commons	Genomic, epigenomic, transcriptomic, and proteomic data from the National Genome Atlas Program	2.5 PB	JSON	https://portal.gdc.cancer.gov
Prelinger archives	Rick Prelinger and The Internet Archive hereby offer public domain films from Prelinger Archives to all for free downloading and reuse.	-	video	https://archive.org/details/prelinger?tab=collection
The Massively Multilingual Image Dataset (MMID)	MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania	1.8 TB	images	https://registry.opendata.aws/mmid/
Genome Aggregation Database	The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects.	35.7 TiB	.gz	https://gnomad.broadinstitute.org/
Million Song Dataset	NSF-funded public music dataset for research	280 GB	-	http://millionsongdataset.com/
The Boxy Vehicles Dataset	A large vehicle detection dataset with almost two million annotated vehicles for training and evaluating object detection methods for self-driving cars on freeways.	1 TB	image	https://boxy-dataset.com/boxy/
A2D2	The Audi Autonomous Driving Dataset (A2D2) to support startups and academic researchers working on autonomous driving.	1.9 TB	point cloud, image	https://www.a2d2.audi/a2d2/en.html
KITTI-raw data	Autonomous Driving	442 GB	point cloud, image	http://www.cvlibs.net/datasets/kitti/raw_data.php
Waymo	The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology.	1.2 TB	point cloud, image	https://waymo.com/open/
COVID-19 Open Research Dataset	An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House	19 GB	JSON	https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
National Cancer Institute	Cancer data for analysis	18.46 TB	JSON	https://portal.gdc.cancer.gov/repository
Public Blockchain Datasets	B lockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex.	9 TB	Various	https://github.com/blockchain-etl/public-datasets
The LibriVox Free Audiobook Collection	LibriVox - founded in 2005 - is a community of volunteers from all over the world who record public domain texts: poetry, short stories, whole books, even dramatic works, in many different language	19.1 TB	Mp3, m3u	https://archive.org/details/librivoxaudio
Cancer Cell Line Encyclopedia (CCLE)	The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.	20.7 TiB	Various	https://portals.broadinstitute.org/ccle/about
Free Music Archive	106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres	879 GB	MP3	https://github.com/mdeff/fma
Open Images Dataset	9 million URLs to images that have been annotated with labels spanning over 6000 categories	18 TB	PNG	https://storage.googleapis.com/openimages/web/index.html
Filecoin Proofs	-	224 GB	-	https://proofs.filecoin.io/
Filecoin Trusted Setup	-	2.05 TB	-	https://trusted-setup.filecoin.io/
openFDA	Open datasets from the US Food and Drug Administration	N/A	JSON	https://open.fda.gov/data/downloads/
AVSpeech: Large-scale Audio-Visual Speech Dataset	large-scale audio-visual dataset comprising speech video clips with no interfering background noises	1.50 TB	N/A	https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41
COCO 2017 Dataset	COCO is a large-scale object detection, segmentation, and captioning dataset.	26 GB	jpg	https://www.kaggle.com/awsaf49/coco-2017-dataset
PandaSet Dataset	Public large-scale dataset for autonomous driving	31 GB	Various	https://www.kaggle.com/usharengaraju/pandaset-dataset
GloVe Reddit Comments	Global Vectors for Word Representation based on Reddit comments	24 GB	Various	https://www.kaggle.com/leighplt/glove-reddit-comments
Open Library Data Dump	Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project.	10 GB	txt	https://openlibrary.org/data/ol_dump_latest.txt.gz
GHTorrent Project	A scalable, queriable, offline mirror of data offered through the Github REST API.	18 TB	MySQL	https://ghtorrent.org/
Open Library Data Dump	Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project.	10 GB	txt	https://openlibrary.org/data/ol_dump_latest.txt.gz
Fly Brain Anatomy	Fluorescence images of Drosophila melanogaster driver lines for neuroscience research purposes, stored in formats suitable for rapid searching in the cloud.	119 TB	images and videos	https://registry.opendata.aws/janelia-flylight/
Foldingathome COVID-19 Dataset	Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies.	71 TiB	xtc	https://registry.opendata.aws/foldingathome-covid19/
SnpEff	Genomic variant annotations and functional effect prediction toolbox	2TiB	vcf	https://docs.microsoft.com/en-us/azure/open-datasets/dataset-snpeff
Russian Open Speech To Text	A collection of speech samples derived from various audio sources. The dataset contains short audio clips in Russian.	3TiB	wav/opus	https://docs.microsoft.com/en-us/azure/open-datasets/dataset-open-speech-text
TartanAir	AirSim simulation dataset for simultaneous localization and mapping (SLAM)	3TiB	png/npy/txt	https://docs.microsoft.com/en-us/azure/open-datasets/dataset-tartanair-simulation

Deprioritized datasets

These datasets have been temporarily disqualified for the current phase of the competition, but may be reinstated in the future.

Name	Description	Size	Format	URL
Sloan Digital Sky Survey	Three dimensional view of the universe	273 TB	Various	https://www.sdss.org/
Flickr Commons	The key goal of The Commons is to share hidden treasures from the world's public photography archives.	50 TB	jpeg	https://www.flickr.com/commons
Free Rainbow Tables	The goal of FreeRainbowTables.com is to prove the insecurity of using simple hash routines to protect valuable passwords, and force developers to use more secure methods.	-	Various	https://freerainbowtables.com/
Chest X-Ray Images (Pneumonia)	5,863 images, 2 categories	2.29 GB	JPEG	https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
163 source Dataset	NetEase Open Source Mirror Station	-	iso	https://mirrors.163.com
Huge Stock Market Dataset	Historical daily prices and volumes of all U.S. stocks and ETFs	772 MB	CSV	https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies	A large-scale video dataset, featuring clips from movies with detailed captions.	250 GB	Video	https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011)	Compressed USENET posts	36 GB	Text	http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Noisy speech database	Used for training speech enhancement algorithms and TTS models	14 GB	WAV	https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play	The data has three tables: teams, players, and plays.	2.54 GB	Text	https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data	include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.	267 GB	CSV	https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Docker Images	Docker container images that are published on Docker Hub	167 TB	images	https://hub.docker.com/
Arxiv	Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more.	-	PDF	https://arxiv.org/
Audius	An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol.	-	MP3	https://audius.co/
Blackbird Dataset	A large-scale dataset for UAV perception in aggressive flight	4.79 TB	-	https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656
ArchLinux	ArchLinux packages repository	56 GB	Various	https://wiki.archlinux.org/index.php/Mirrors
CentOS	CentOS packages repository	200 GB	Various	http://mirror.sesp.northwestern.edu/centos/
Tencent Corpus for Chinese Words and Phrases	Meant for AI purposes	6.3 GB	Various	https://ai.tencent.com/ailab/nlp/en/embedding.html
R-fMRI Maps Project	Medical data from neurological imaging	-	Various	http://mrirc.psych.ac.cn/RfMRIMaps
National Palace Museum (Taiwan)	A variety of museum artifacts	-	Various	https://theme.npm.edu.tw/opendata/
Project Gutenberg	online library of free eBooks - english	60 GB	various	https://www.gutenberg.org
ImageNet	an image database organized according to the WordNet hierarchy	1.2 TB	jpeg	http://www.image-net.org/
IPUMS	Global census data	-	Structured data	https://ipums.org/
Udacity Self-Driving Car data	Data used for training self-driving machine learning models	~285 GB	-	https://github.com/udacity/self-driving-car/tree/master/datasets
NEAR-VI-Dataset	The NetEase AR Oriented Visual Inertial Dataset	175 GB	gif	https://github.com/EZXR-Research/NEAR-VI-Dataset
Top 100 Crypto Investor Dataset	Crypto price and project analytics	9 GB	Various	https://www.kaggle.com/georgemac510/top-100-crypto-dataset
Common Voice	Common Voice is Mozilla's initiative to help teach machines how real people speak.	100 GB	audio	https://commonvoice.mozilla.org/en/datasets
TAO	TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.	225 GB	video	http://taodataset.org/
OTW	The Out the Window (OTW) dataset is a crowdsourced activity dataset containing 5,668 instances of 17 activities from the NIST Activities in Extended Video (ActEV) challenge.	48 GB	video	https://stresearch.github.io/otw/
IMDB-WIKI	IMDB-WIKI – 500k+ face images with age and gender labels	276 GB	image	https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
OpenStreetMap	A collaborative project to create a free editable map of the world	40 GB	JSON	https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?filter=solution-type%3Adataset&filter=category%3Atransportation&id=88e087d0-5f92-4407-8dcc-5577bd06d776
Google Open Images	9 million URLs to images that have been annotated with labels spanning over 6000 categories	456 GB	image	https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
UC Berkeley Computer Science Courses	An archive of UC Berkeley Computer Science Courses	446 GB	Video	https://academictorrents.com/details/5e84be34f69b1a313f6dcb51667edf238d5d4412
NEAR-VI-Dataset	The NetEase AR Oriented Visual Inertial Dataset	175 GB	gif	https://github.com/EZXR-Research/NEAR-VI-Dataset
COCO	COCO is a large-scale object detection, segmentation, and captioning dataset.	-	ZIP	https://cocodataset.org
3000 Rice Genomes Project	An international effort to sequence the genomes of 3,024 rice varieties from 89 countries.	-	BAM, VCF	https://registry.opendata.aws/3kricegenome/
Public domain movies	A collection of public domain movies	1 TB	video	https://archive.org/details/publicmovies212
UK Biobank Pan-Ancestry Summary Statistics	A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies.	75.2 TiB	.gz	https://pan.ukbb.broadinstitute.org
Genome in a Bottle	Several reference genomes to enable translation of whole human genome sequencing to clinical practice.	89.9 TiB	.gz	http://genomeinabottle.org/
MusicNet Dataset	A curated collection of labeled classical music.	31 GB	wav	https://www.kaggle.com/imsparsh/musicnet-dataset
CLEVR Dataset	A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.	19 GB	png&json	https://www.kaggle.com/timoboz/clevr-dataset
COCO2017	Common Objects in Context	20 GB	jpg	https://www.kaggle.com/aishwr/coco2017

Disqualified datasets

Here is the list of open datasets that were onboarded in previous phases that no longer qualify for Slingshot rewards.

Name	Description	Size	Format	URL
Encyclopedia of DNA Elements	The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI).	660.7 TiB	gz,bigbed	https://www.encodeproject.org
Common Crawl	An open repository of web crawl data	235 TB	WARC	https://commoncrawl.org/
Genome Ark	The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects.	245.6 TiB	Various	https://vertebrategenomesproject.org
Landsat 8	Multispectral time series satellite imagery of all land on Earth since 2013	1.3 PB (approx)	GeoTIFF + metadata - sample scene	https://registry.opendata.aws/landsat-8/#usageexamples
Linux ISO	Linux ISO Images	-	ISO	https://www.linuxlookup.com/linux_iso

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Curated Datasets for the Slingshot Competition

Current qualifying datasets

Deprioritized datasets

Disqualified datasets

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Curated Datasets for the Slingshot Competition

Current qualifying datasets

Deprioritized datasets

Disqualified datasets