Skip to content

Latest commit

 

History

History
109 lines (96 loc) · 15.9 KB

datasets.md

File metadata and controls

109 lines (96 loc) · 15.9 KB

Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.

If you would like to use a dataset that you don't see listed here, please submit an issue to add the dataset to this table. In order to qualify for Slingshot, a dataset should generally be a public good dataset, be accessible to everyone, and not require any special permissions to access. If you are using your own data that you are willing to make public but does not have a source URL, then please share a link to download it in the Link to Dataset field.

Current qualifying datasets

In an effort to continue diversifying the data being onboarded onto the network, the list of qualifying datasets changes over time as participating teams onboard more data onto the network! Datasets that qualified in previous phases of Slingshot and no longer qualify as listed separately below.

Name Description Size Format URL
CCAFS-Climate Data High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments. 6.8 TiB Various http://www.ccafs-climate.org/
ECMWF ERA5 Reanalysis ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service. 7.6 TiB .nc https://apps.ecmwf.int/datasets/
Genomic Data Commons Genomic, epigenomic, transcriptomic, and proteomic data from the National Genome Atlas Program 2.5 PB JSON https://portal.gdc.cancer.gov
Prelinger archives Rick Prelinger and The Internet Archive hereby offer public domain films from Prelinger Archives to all for free downloading and reuse. - video https://archive.org/details/prelinger?tab=collection
The Massively Multilingual Image Dataset (MMID) MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania 1.8 TB images https://registry.opendata.aws/mmid/
Genome Aggregation Database The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. 35.7 TiB .gz https://gnomad.broadinstitute.org/
Million Song Dataset NSF-funded public music dataset for research 280 GB - http://millionsongdataset.com/
The Boxy Vehicles Dataset A large vehicle detection dataset with almost two million annotated vehicles for training and evaluating object detection methods for self-driving cars on freeways. 1 TB image https://boxy-dataset.com/boxy/
A2D2 The Audi Autonomous Driving Dataset (A2D2) to support startups and academic researchers working on autonomous driving. 1.9 TB point cloud, image https://www.a2d2.audi/a2d2/en.html
KITTI-raw data Autonomous Driving 442 GB point cloud, image http://www.cvlibs.net/datasets/kitti/raw_data.php
Waymo The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology. 1.2 TB point cloud, image https://waymo.com/open/
COVID-19 Open Research Dataset An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House 19 GB JSON https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
National Cancer Institute Cancer data for analysis 18.46 TB JSON https://portal.gdc.cancer.gov/repository
Public Blockchain Datasets B lockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex. 9 TB Various https://github.com/blockchain-etl/public-datasets
The LibriVox Free Audiobook Collection LibriVox - founded in 2005 - is a community of volunteers from all over the world who record public domain texts: poetry, short stories, whole books, even dramatic works, in many different language 19.1 TB Mp3, m3u https://archive.org/details/librivoxaudio
Cancer Cell Line Encyclopedia (CCLE) The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. This dataset contains RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data. 20.7 TiB Various https://portals.broadinstitute.org/ccle/about
Free Music Archive 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres 879 GB MP3 https://github.com/mdeff/fma
Open Images Dataset 9 million URLs to images that have been annotated with labels spanning over 6000 categories 18 TB PNG https://storage.googleapis.com/openimages/web/index.html
Filecoin Proofs - 224 GB - https://proofs.filecoin.io/
Filecoin Trusted Setup - 2.05 TB - https://trusted-setup.filecoin.io/
openFDA Open datasets from the US Food and Drug Administration N/A JSON https://open.fda.gov/data/downloads/
AVSpeech: Large-scale Audio-Visual Speech Dataset large-scale audio-visual dataset comprising speech video clips with no interfering background noises 1.50 TB N/A https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41
COCO 2017 Dataset COCO is a large-scale object detection, segmentation, and captioning dataset. 26 GB jpg https://www.kaggle.com/awsaf49/coco-2017-dataset
PandaSet Dataset Public large-scale dataset for autonomous driving 31 GB Various https://www.kaggle.com/usharengaraju/pandaset-dataset
GloVe Reddit Comments Global Vectors for Word Representation based on Reddit comments 24 GB Various https://www.kaggle.com/leighplt/glove-reddit-comments
Open Library Data Dump Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project. 10 GB txt https://openlibrary.org/data/ol_dump_latest.txt.gz
GHTorrent Project A scalable, queriable, offline mirror of data offered through the Github REST API. 18 TB MySQL https://ghtorrent.org/
Open Library Data Dump Records of over 22 million works, 30 million editions and 8 million authors, curated by the Open Library project. 10 GB txt https://openlibrary.org/data/ol_dump_latest.txt.gz
Fly Brain Anatomy Fluorescence images of Drosophila melanogaster driver lines for neuroscience research purposes, stored in formats suitable for rapid searching in the cloud. 119 TB images and videos https://registry.opendata.aws/janelia-flylight/
Foldingathome COVID-19 Dataset Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. 71 TiB xtc https://registry.opendata.aws/foldingathome-covid19/
SnpEff Genomic variant annotations and functional effect prediction toolbox 2TiB vcf https://docs.microsoft.com/en-us/azure/open-datasets/dataset-snpeff
Russian Open Speech To Text A collection of speech samples derived from various audio sources. The dataset contains short audio clips in Russian. 3TiB wav/opus https://docs.microsoft.com/en-us/azure/open-datasets/dataset-open-speech-text
TartanAir AirSim simulation dataset for simultaneous localization and mapping (SLAM) 3TiB png/npy/txt https://docs.microsoft.com/en-us/azure/open-datasets/dataset-tartanair-simulation

Deprioritized datasets

These datasets have been temporarily disqualified for the current phase of the competition, but may be reinstated in the future.

Name Description Size Format URL
Sloan Digital Sky Survey Three dimensional view of the universe 273 TB Various https://www.sdss.org/
Flickr Commons The key goal of The Commons is to share hidden treasures from the world's public photography archives. 50 TB jpeg https://www.flickr.com/commons
Free Rainbow Tables The goal of FreeRainbowTables.com is to prove the insecurity of using simple hash routines to protect valuable passwords, and force developers to use more secure methods. - Various https://freerainbowtables.com/
Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories 2.29 GB JPEG https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
163 source Dataset NetEase Open Source Mirror Station - iso https://mirrors.163.com
Huge Stock Market Dataset Historical daily prices and volumes of all U.S. stocks and ETFs 772 MB CSV https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies A large-scale video dataset, featuring clips from movies with detailed captions. 250 GB Video https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011) Compressed USENET posts 36 GB Text http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Noisy speech database Used for training speech enhancement algorithms and TTS models 14 GB WAV https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play The data has three tables: teams, players, and plays. 2.54 GB Text https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. 267 GB CSV https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Docker Images Docker container images that are published on Docker Hub 167 TB images https://hub.docker.com/
Arxiv Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more. - PDF https://arxiv.org/
Audius An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol. - MP3 https://audius.co/
Blackbird Dataset A large-scale dataset for UAV perception in aggressive flight 4.79 TB - https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656
ArchLinux ArchLinux packages repository 56 GB Various https://wiki.archlinux.org/index.php/Mirrors
CentOS CentOS packages repository 200 GB Various http://mirror.sesp.northwestern.edu/centos/
Tencent Corpus for Chinese Words and Phrases Meant for AI purposes 6.3 GB Various https://ai.tencent.com/ailab/nlp/en/embedding.html
R-fMRI Maps Project Medical data from neurological imaging - Various http://mrirc.psych.ac.cn/RfMRIMaps
National Palace Museum (Taiwan) A variety of museum artifacts - Various https://theme.npm.edu.tw/opendata/
Project Gutenberg  online library of free eBooks - english  60 GB  various https://www.gutenberg.org
ImageNet an image database organized according to the WordNet hierarchy 1.2 TB jpeg http://www.image-net.org/
IPUMS Global census data - Structured data https://ipums.org/
Udacity Self-Driving Car data Data used for training self-driving machine learning models ~285 GB - https://github.com/udacity/self-driving-car/tree/master/datasets
NEAR-VI-Dataset The NetEase AR Oriented Visual Inertial Dataset 175 GB gif https://github.com/EZXR-Research/NEAR-VI-Dataset
Top 100 Crypto Investor Dataset Crypto price and project analytics 9 GB Various https://www.kaggle.com/georgemac510/top-100-crypto-dataset
Common Voice Common Voice is Mozilla's initiative to help teach machines how real people speak. 100 GB audio https://commonvoice.mozilla.org/en/datasets
TAO TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. 225 GB video http://taodataset.org/
OTW The Out the Window (OTW) dataset is a crowdsourced activity dataset containing 5,668 instances of 17 activities from the NIST Activities in Extended Video (ActEV) challenge. 48 GB video https://stresearch.github.io/otw/
IMDB-WIKI IMDB-WIKI – 500k+ face images with age and gender labels 276 GB image https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
OpenStreetMap A collaborative project to create a free editable map of the world 40 GB JSON https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?filter=solution-type%3Adataset&filter=category%3Atransportation&id=88e087d0-5f92-4407-8dcc-5577bd06d776
Google Open Images 9 million URLs to images that have been annotated with labels spanning over 6000 categories 456 GB image https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
UC Berkeley Computer Science Courses An archive of UC Berkeley Computer Science Courses 446 GB Video https://academictorrents.com/details/5e84be34f69b1a313f6dcb51667edf238d5d4412
NEAR-VI-Dataset The NetEase AR Oriented Visual Inertial Dataset 175 GB gif https://github.com/EZXR-Research/NEAR-VI-Dataset
COCO COCO is a large-scale object detection, segmentation, and captioning dataset. - ZIP https://cocodataset.org
3000 Rice Genomes Project An international effort to sequence the genomes of 3,024 rice varieties from 89 countries. - BAM, VCF https://registry.opendata.aws/3kricegenome/
Public domain movies A collection of public domain movies 1 TB video https://archive.org/details/publicmovies212
UK Biobank Pan-Ancestry Summary Statistics A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. 75.2 TiB .gz https://pan.ukbb.broadinstitute.org
Genome in a Bottle Several reference genomes to enable translation of whole human genome sequencing to clinical practice. 89.9 TiB .gz http://genomeinabottle.org/
MusicNet Dataset A curated collection of labeled classical music. 31 GB wav https://www.kaggle.com/imsparsh/musicnet-dataset
CLEVR Dataset A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. 19 GB png&json https://www.kaggle.com/timoboz/clevr-dataset
COCO2017 Common Objects in Context 20 GB jpg https://www.kaggle.com/aishwr/coco2017

Disqualified datasets

Here is the list of open datasets that were onboarded in previous phases that no longer qualify for Slingshot rewards.

Name Description Size Format URL
Encyclopedia of DNA Elements The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). 660.7 TiB gz,bigbed https://www.encodeproject.org
Common Crawl An open repository of web crawl data 235 TB WARC https://commoncrawl.org/
Genome Ark The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. 245.6 TiB Various https://vertebrategenomesproject.org
Landsat 8 Multispectral time series satellite imagery of all land on Earth since 2013 1.3 PB (approx) GeoTIFF + metadata - sample scene https://registry.opendata.aws/landsat-8/#usageexamples
Linux ISO Linux ISO Images - ISO https://www.linuxlookup.com/linux_iso