Data prepation

This repository contains all the tools and code used to build the ROOTS dataset produced by the BigScience initiative to train the BLOOM models as well as a reduced version to train the tokenizer.

General pipeline for the preparation of the ROOTS dataset

More detail on the process, including the specifics of the cleaning, filtering, and deduplication operations, can be found in Sections 2 "(Crowd)Sourcing a Language Resource Catalogue" and 3 "Processing OSCAR" of our paper on the ROOTS dataset creation.

Key resources

Code for making the Pseudo-Crawl dataset

Filtering library used to filter OSCAR

Code used to run preprocessing pipeline on crowdsourced dataset

Code used for the tokenizer's training dataset

Code used for making the analysis and plots for the paper

Citation

@inproceedings{
bigscience-roots:2022,
title={The BigScience {ROOTS} Corpus: A 1.6{TB} Composite Multilingual Dataset},
author={Hugo Lauren{\c{c}}on and Lucile Saulnier and Thomas Wang and Christopher Akiki and Albert Villanova del Moral and Teven Le Scao and Leandro Von Werra and Chenghao Mou and Eduardo Gonz{\'a}lez Ponferrada and Huu Nguyen and J{\"o}rg Frohberg and Mario {\v{S}}a{\v{s}}ko and Quentin Lhoest and Angelina McMillan-Major and G{\'e}rard Dupont and Stella Biderman and Anna Rogers and Loubna Ben allal and Francesco De Toni and Giada Pistilli and Olivier Nguyen and Somaieh Nikpoor and Maraim Masoud and Pierre Colombo and Javier de la Rosa and Paulo Villegas and Tristan Thrush and Shayne Longpre and Sebastian Nagel and Leon Weber and Manuel Romero Mu{\~n}oz and Jian Zhu and Daniel Van Strien and Zaid Alyafeai and Khalid Almubarak and Vu Minh Chien and Itziar Gonzalez-Dios and Aitor Soroa and Kyle Lo and Manan Dey and Pedro Ortiz Suarez and Aaron Gokaslan and Shamik Bose and David Ifeoluwa Adelani and Long Phan and Hieu Tran and Ian Yu and Suhas Pai and Jenny Chim and Violette Lepercq and Suzana Ilic and Margaret Mitchell and Sasha Luccioni and Yacine Jernite},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=UoEw6KigkUn}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
analysis		analysis
preprocessing		preprocessing
sourcing		sourcing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
roots_pipeline.png		roots_pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data prepation

General pipeline for the preparation of the ROOTS dataset

Key resources

Code for making the Pseudo-Crawl dataset

Filtering library used to filter OSCAR

Code used to run preprocessing pipeline on crowdsourced dataset

Code used for the tokenizer's training dataset

Code used for making the analysis and plots for the paper

Citation

About

Contributors 13

Languages

License

bigscience-workshop/data-preparation

Folders and files

Latest commit

History

Repository files navigation

Data prepation

General pipeline for the preparation of the ROOTS dataset

Key resources

Code for making the Pseudo-Crawl dataset

Filtering library used to filter OSCAR

Code used to run preprocessing pipeline on crowdsourced dataset

Code used for the tokenizer's training dataset

Code used for making the analysis and plots for the paper

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 13

Languages