-
Notifications
You must be signed in to change notification settings - Fork 6
What languages are included in the catalogue?
While the catalogue is being built as part of the BigScience effort which focuses on a predefined set of languages, it will continue being maintained and aims to be of general use beyond the scope of this project. As such, we welcome contributions of entries corresponding to any language.
The BigScience effort chose to focus on the following languages and language group based on a combination for demographic and geographic coverage and availability and first-hand knowledge of BigScience participants. We especially invite contributions for entries corresponding to the following language groups:
- African Languages of the Niger-Congo family, including e.g. Swahili and other Bantu languages
- Arabic
- Basque
- Catalan
- Chinese
- English
- French
- Indic languages, including Bengali, Hindi, Urdu
- Indonesian
- Portuguese
- Spanish
- Vietnamese
We also welcome contributions of programming language data to test a large-scale model's ability to learn their distribution.
If you choose, African languages
, Arabic
, or Indic languages
for your entry, a further drop-down menu will also appear to allow you to select the specific language, language variety, or dialect: the full list can be found here.
We also recommend you add free text comments about the language variety whenever possible (for example, language variety information not covered by the above selection), as this will be helpful to navigate the catalogue!
If the language or one of the languages corresponding to your entry is absent from the above list, you can bring up a selection menu with a broader selection (all languages that have a BCP-47
code) by checking the Show other languages
box in the Languages and Locations section of the form