Home

Welcome to the BigScience🌸 Data Sourcing Hackathon!

BigScience🌸 is a year-long open scientific collaboration of 600 researchers from 50 countries and more than 250 institutions who collaborate on creating a very large multilingual neural network language model trained on a very large multilingual text dataset (see here for information about the language choices).

The goal of this hackathon is to document and collect data sources for the BigScience:cherry_blossom: dataset. We are looking to gather a wide variety of resources that represent different kinds of language use: different regions 🌏🌍🌎, different contexts 🏫🏥🏠, and different audiences 🥼🍿📰. In order to collect as many examples of these variations as possible, we need to look for a variety of data types and formats such as books and formal publications, audio formats including radio and podcasts, and others, in addition to traditional web sources.

Read: This is not (quite) a dataset sprint!
New (10/14): Join the Discord to chat about the event and to meet other participants!
New (10/13): Video Guide! Watch for a walkthrough of adding a resource to the catalogue 📺📚🌸

To get started, take a look at the guide and the form to familiarize yourself with the questions and the kind of information you'll need to submit. You can explore the catalogue visualization in the form to get an idea of what resources have already been submitted and what kind of resources you'd like to submit. If you have questions, you can search the FAQ below or using the bar on the right, or post an issue to the GitHub. For those looking for inspiration on what language resources to add to the catalogue, the BigScience data sourcing team has done an extensive preparatory work of referencing leads for several languages in the following document; you can pick any of these (don't forget to look through all the tabs), check that it isn't already in the catalogue using the exploration mode, and get adding!

Finally, to encourage participation and thank you for your efforts, we will have a leaderboard of contributors with T-shirt prizes for people who submit and validate the most resources in each language - we're looking forward to sending those all around the world at the end of the sprint 🌸🌏👕🌍🏆🌎🌸

Get started in three easy steps!

Propose a new language resource, pick one from the precompiled list, or grab a HF dataset. Check whether it's already in the catalogue
Go to the form, fill out the information for your entry with help from this FAQ and the full guide.
Don't forget to save your work! (After you press the button, the form will let you know if there's some information missing)

Communication channels for the event

We encourage all participants to ask questions, make suggestions, and report issues by submitting and reading through GitHub issues for this repository. Before you submit a new issue or reply to an existing one, please read through our Code of Conduct.

Frequently Asked Questions

Why build a catalogue?

What languages are included in the catalogue?

Privacy Policy for the submitted User Information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the BigScience🌸 Data Sourcing Hackathon!

Get started in three easy steps!

Communication channels for the event

Frequently Asked Questions

Categories

Content

Functionality

Other

Clone this wiki locally