Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.
There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.
If you would like to use a dataset that you don't see listed here, please submit an issue to add the dataset to this table. In order to qualify for Slingshot, a dataset should generally be a public good dataset, be accessible to everyone, and not require any special permissions to access. If you are using your own data that you are willing to make public but does not have a source URL, then please share a link to download it in the Link to Dataset field.
In an effort to continue diversifying the data being onboarded onto the network, the list of qualifying datasets changes over time as participating teams onboard more data onto the network! Datasets that qualified in previous phases of Slingshot and no longer qualify as listed separately below.
Name | Description | Size | Format | URL |
---|---|---|---|---|
CCAFS-Climate Data | High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments. | 6.8 TiB | Various | http://www.ccafs-climate.org/ |
ECMWF ERA5 Reanalysis | ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service. | 7.6 TiB | .nc | https://apps.ecmwf.int/datasets/ |
Genomic Data Commons | Genomic, epigenomic, transcriptomic, and proteomic data from the National Genome Atlas Program | 2.5 PB | JSON | https://portal.gdc.cancer.gov |
Prelinger archives | Rick Prelinger and The Internet Archive hereby offer public domain films from Prelinger Archives to all for free downloading and reuse. | - | video | https://archive.org/details/prelinger?tab=collection |
The Massively Multilingual Image Dataset (MMID) | MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania | 1.8 TB | images | https://registry.opendata.aws/mmid/ |
Genome Aggregation Database | The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. | 35.7 TiB | .gz | https://gnomad.broadinstitute.org/ |
Million Song Dataset | NSF-funded public music dataset for research | 280 GB | - | http://millionsongdataset.com/ |
The Boxy Vehicles Dataset | A large vehicle detection dataset with almost two million annotated vehicles for training and evaluating object detection methods for self-driving cars on freeways. | 1 TB | image | https://boxy-dataset.com/boxy/ |
A2D2 | The Audi Autonomous Driving Dataset (A2D2) to support startups and academic researchers working on autonomous driving. | 1.9 TB | point cloud, image | https://www.a2d2.audi/a2d2/en.html |
KITTI-raw data | Autonomous Driving | 442 GB | point cloud, image | http://www.cvlibs.net/datasets/kitti/raw_data.php |
Waymo | The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology. | 1.2 TB | point cloud, image | https://waymo.com/open/ |
COVID-19 Open Research Dataset | An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House | 19 GB | JSON | https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge |
National Cancer Institute | Cancer data for analysis | 18.46 TB | JSON | https://portal.gdc.cancer.gov/repository |
Public Blockchain Datasets | B lockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex. | 9 TB | Various | https://github.com/blockchain-etl/public-datasets |
The LibriVox Free Audiobook Collection | LibriVox - founded in 2005 - is a community of volunteers from all over the world who record public domain texts: poetry, short stories, whole books, even dramatic works, in many different language | 19.1 TB | Mp3, m3u | https://archive.org/details/librivoxaudio |
Cancer Cell Line Encyclopedia (CCLE) | The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data. | 20.7 TiB | Various | https://portals.broadinstitute.org/ccle/about |
Free Music Archive | 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres | 879 GB | MP3 | https://github.com/mdeff/fma |
Open Images Dataset | 9 million URLs to images that have been annotated with labels spanning over 6000 categories | 18 TB | PNG | https://storage.googleapis.com/openimages/web/index.html |
Filecoin Proofs | - | 224 GB | - | https://proofs.filecoin.io/ |
Filecoin Trusted Setup | - | 2.05 TB | - | https://trusted-setup.filecoin.io/ |
openFDA | Open datasets from the US Food and Drug Administration | N/A | JSON | https://open.fda.gov/data/downloads/ |
AVSpeech: Large-scale Audio-Visual Speech Dataset | large-scale audio-visual dataset comprising speech video clips with no interfering background noises | 1.50 TB | N/A | https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41 |
COCO 2017 Dataset | COCO is a large-scale object detection, segmentation, and captioning dataset. | 26 GB | jpg | https://www.kaggle.com/awsaf49/coco-2017-dataset |
PandaSet Dataset | Public large-scale dataset for autonomous driving | 31 GB | Various | https://www.kaggle.com/usharengaraju/pandaset-dataset |
GloVe Reddit Comments | Global Vectors for Word Representation based on Reddit comments | 24 GB | Various | https://www.kaggle.com/leighplt/glove-reddit-comments |
Open Library Data Dump | Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project. | 10 GB | txt | https://openlibrary.org/data/ol_dump_latest.txt.gz |
GHTorrent Project | A scalable, queriable, offline mirror of data offered through the Github REST API. | 18 TB | MySQL | https://ghtorrent.org/ |
Open Library Data Dump | Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project. | 10 GB | txt | https://openlibrary.org/data/ol_dump_latest.txt.gz |
Fly Brain Anatomy | Fluorescence images of Drosophila melanogaster driver lines for neuroscience research purposes, stored in formats suitable for rapid searching in the cloud. | 119 TB | images and videos | https://registry.opendata.aws/janelia-flylight/ |
Foldingathome COVID-19 Dataset | Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. | 71 TiB | xtc | https://registry.opendata.aws/foldingathome-covid19/ |
SnpEff | Genomic variant annotations and functional effect prediction toolbox | 2TiB | vcf | https://docs.microsoft.com/en-us/azure/open-datasets/dataset-snpeff |
Russian Open Speech To Text | A collection of speech samples derived from various audio sources. The dataset contains short audio clips in Russian. | 3TiB | wav/opus | https://docs.microsoft.com/en-us/azure/open-datasets/dataset-open-speech-text |
TartanAir | AirSim simulation dataset for simultaneous localization and mapping (SLAM) | 3TiB | png/npy/txt | https://docs.microsoft.com/en-us/azure/open-datasets/dataset-tartanair-simulation |
These datasets have been temporarily disqualified for the current phase of the competition, but may be reinstated in the future.
Name | Description | Size | Format | URL |
---|---|---|---|---|
Sloan Digital Sky Survey | Three dimensional view of the universe | 273 TB | Various | https://www.sdss.org/ |
Flickr Commons | The key goal of The Commons is to share hidden treasures from the world's public photography archives. | 50 TB | jpeg | https://www.flickr.com/commons |
Free Rainbow Tables | The goal of FreeRainbowTables.com is to prove the insecurity of using simple hash routines to protect valuable passwords, and force developers to use more secure methods. | - | Various | https://freerainbowtables.com/ |
Chest X-Ray Images (Pneumonia) | 5,863 images, 2 categories | 2.29 GB | JPEG | https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia |
163 source Dataset | NetEase Open Source Mirror Station | - | iso | https://mirrors.163.com |
Huge Stock Market Dataset | Historical daily prices and volumes of all U.S. stocks and ETFs | 772 MB | CSV | https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs |
Condensed Movies | A large-scale video dataset, featuring clips from movies with detailed captions. | 250 GB | Video | https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/ |
USENET (2005-2011) | Compressed USENET posts | 36 GB | Text | http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html |
Noisy speech database | Used for training speech enhancement algorithms and TTS models | 14 GB | WAV | https://datashare.is.ed.ac.uk/handle/10283/2791 |
NFL play-by-play | The data has three tables: teams, players, and plays. | 2.54 GB | Text | https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play |
NYC Trip Record Data | include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. | 267 GB | CSV | https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page |
Docker Images | Docker container images that are published on Docker Hub | 167 TB | images | https://hub.docker.com/ |
Arxiv | Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more. | - | https://arxiv.org/ | |
Audius | An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol. | - | MP3 | https://audius.co/ |
Blackbird Dataset | A large-scale dataset for UAV perception in aggressive flight | 4.79 TB | - | https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656 |
ArchLinux | ArchLinux packages repository | 56 GB | Various | https://wiki.archlinux.org/index.php/Mirrors |
CentOS | CentOS packages repository | 200 GB | Various | http://mirror.sesp.northwestern.edu/centos/ |
Tencent Corpus for Chinese Words and Phrases | Meant for AI purposes | 6.3 GB | Various | https://ai.tencent.com/ailab/nlp/en/embedding.html |
R-fMRI Maps Project | Medical data from neurological imaging | - | Various | http://mrirc.psych.ac.cn/RfMRIMaps |
National Palace Museum (Taiwan) | A variety of museum artifacts | - | Various | https://theme.npm.edu.tw/opendata/ |
Project Gutenberg | online library of free eBooks - english | 60 GB | various | https://www.gutenberg.org |
ImageNet | an image database organized according to the WordNet hierarchy | 1.2 TB | jpeg | http://www.image-net.org/ |
IPUMS | Global census data | - | Structured data | https://ipums.org/ |
Udacity Self-Driving Car data | Data used for training self-driving machine learning models | ~285 GB | - | https://github.com/udacity/self-driving-car/tree/master/datasets |
NEAR-VI-Dataset | The NetEase AR Oriented Visual Inertial Dataset | 175 GB | gif | https://github.com/EZXR-Research/NEAR-VI-Dataset |
Top 100 Crypto Investor Dataset | Crypto price and project analytics | 9 GB | Various | https://www.kaggle.com/georgemac510/top-100-crypto-dataset |
Common Voice | Common Voice is Mozilla's initiative to help teach machines how real people speak. | 100 GB | audio | https://commonvoice.mozilla.org/en/datasets |
TAO | TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. | 225 GB | video | http://taodataset.org/ |
OTW | The Out the Window (OTW) dataset is a crowdsourced activity dataset containing 5,668 instances of 17 activities from the NIST Activities in Extended Video (ActEV) challenge. | 48 GB | video | https://stresearch.github.io/otw/ |
IMDB-WIKI | IMDB-WIKI – 500k+ face images with age and gender labels | 276 GB | image | https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/ |
OpenStreetMap | A collaborative project to create a free editable map of the world | 40 GB | JSON | https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?filter=solution-type%3Adataset&filter=category%3Atransportation&id=88e087d0-5f92-4407-8dcc-5577bd06d776 |
Google Open Images | 9 million URLs to images that have been annotated with labels spanning over 6000 categories | 456 GB | image | https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b |
UC Berkeley Computer Science Courses | An archive of UC Berkeley Computer Science Courses | 446 GB | Video | https://academictorrents.com/details/5e84be34f69b1a313f6dcb51667edf238d5d4412 |
NEAR-VI-Dataset | The NetEase AR Oriented Visual Inertial Dataset | 175 GB | gif | https://github.com/EZXR-Research/NEAR-VI-Dataset |
COCO | COCO is a large-scale object detection, segmentation, and captioning dataset. | - | ZIP | https://cocodataset.org |
3000 Rice Genomes Project | An international effort to sequence the genomes of 3,024 rice varieties from 89 countries. | - | BAM, VCF | https://registry.opendata.aws/3kricegenome/ |
Public domain movies | A collection of public domain movies | 1 TB | video | https://archive.org/details/publicmovies212 |
UK Biobank Pan-Ancestry Summary Statistics | A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. | 75.2 TiB | .gz | https://pan.ukbb.broadinstitute.org |
Genome in a Bottle | Several reference genomes to enable translation of whole human genome sequencing to clinical practice. | 89.9 TiB | .gz | http://genomeinabottle.org/ |
MusicNet Dataset | A curated collection of labeled classical music. | 31 GB | wav | https://www.kaggle.com/imsparsh/musicnet-dataset |
CLEVR Dataset | A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. | 19 GB | png&json | https://www.kaggle.com/timoboz/clevr-dataset |
COCO2017 | Common Objects in Context | 20 GB | jpg | https://www.kaggle.com/aishwr/coco2017 |
Here is the list of open datasets that were onboarded in previous phases that no longer qualify for Slingshot rewards.
Name | Description | Size | Format | URL |
---|---|---|---|---|
Encyclopedia of DNA Elements | The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). | 660.7 TiB | gz,bigbed | https://www.encodeproject.org |
Common Crawl | An open repository of web crawl data | 235 TB | WARC | https://commoncrawl.org/ |
Genome Ark | The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. | 245.6 TiB | Various | https://vertebrategenomesproject.org |
Landsat 8 | Multispectral time series satellite imagery of all land on Earth since 2013 | 1.3 PB (approx) | GeoTIFF + metadata - sample scene | https://registry.opendata.aws/landsat-8/#usageexamples |
Linux ISO | Linux ISO Images | - | ISO | https://www.linuxlookup.com/linux_iso |