Skip to content
Yacine Jernite edited this page Oct 14, 2021 · 32 revisions

Welcome to the BigScience🌸 Data Sourcing Hackathon!

BigScience🌸 is a year-long open scientific collaboration of 600 researchers from 50 countries and more than 250 institutions who collaborate on creating a very large multilingual neural network language model trained on a very large multilingual text dataset (see here for information about the language choices).

The goal of this hackathon is to document and collect data sources for the BigScience:cherry_blossom: dataset. We are looking to gather a wide variety of resources that represent different kinds of language use: different regions 🌏🌍🌎, different contexts 🏫🏥🏠, and different audiences 🥼🍿📰. In order to collect as many examples of these variations as possible, we need to look for a variety of data types and formats such as books and formal publications, audio formats including radio and podcasts, and others, in addition to traditional web sources.

To get started, take a look at the guide and the form to familiarize yourself with the questions and the kind of information you'll need to submit. You can explore the catalogue visualization in the form to get an idea of what resources have already been submitted and what kind of resources you'd like to submit. If you have questions, you can search the FAQ below or using the bar on the right, or post an issue to the GitHub. For those looking for inspiration on what language resources to add to the catalogue, the BigScience data sourcing team has done an extensive preparatory work of referencing leads for several languages in the following document; you can pick any of these (don't forget to look through all the tabs), check that it isn't already in the catalogue using the exploration mode, and get adding!

Finally, to encourage participation and thank you for your efforts, we will have a leaderboard of contributors with T-shirt prizes for people who submit and validate the most resources in each language - we're looking forward to sending those all around the world at the end of the sprint 🌸🌏👕🌍🏆🌎🌸

Get started in three easy steps!

  1. Propose a new language resource, pick one from the precompiled list, or grab a HF dataset. Check whether it's already in the catalogue
  2. Go to the form, fill out the information for your entry with help from this FAQ and the full guide.
  3. Don't forget to save your work! (After you press the button, the form will let you know if there's some information missing)

Communication channels for the event

We encourage all participants to ask questions, make suggestions, and report issues by submitting and reading through GitHub issues for this repository. Before you submit a new issue or reply to an existing one, please read through our Code of Conduct.

Frequently Asked Questions

Why build a catalogue?

What languages are included in the catalogue?

Privacy Policy for the submitted User Information

Categories

What are the types of resources I can add?

Primary source or language organization?

Content

What's PII?

How do I find the license for my resource?

Functionality

How do I add a primary source?

How do I add a processed dataset?

How do I add a language organization or advocate?

How do I explore the current catalogue?

How do I validate an existing entry?

Other

What are the boxes with the pluses on the form for?

When I switch modes, the text formatting gets messed up. How do I fix that?