Added names of less-studied languages #4880

BenjaminGalliot · 2022-08-23T19:32:38Z

Added names of less-studied languages (nru – Narua and jya – Japhug) for existing datasets.

albertvillanova

Thanks for the addition, @BenjaminGalliot.

Just a suggested fix below.

src/datasets/utils/resources/languages.json

Added names of less studied languages (with their Glottolog codes) for existing datasets: Yongning Na (yong1288) and Japhug (japh1234).

albertvillanova

As pointed out in my comment below, currently we are using IANA language codes, not Glottolog codes.

Also note that in each dataset card, besides the language tag (validated against this file languages.json), users can use other tags to give further details about the language:

language_bcp47: to list BCP47 language tags, plus any of the allowed suffixes (script, region, variant,...)
language_details: to give further details

src/datasets/utils/resources/languages.json

BenjaminGalliot · 2022-08-24T11:46:40Z

OK, I removed Glottolog codes and only added ISO 639-3 ones. The former are for the moment in corpus card description, language details, and in subcorpora names.

albertvillanova

Thanks.

HuggingFaceDocBuilderDev · 2022-08-24T12:17:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

alexis-michaud mentioned this pull request Aug 23, 2022

Language names and language codes: connecting to a big database (rather than slow enrichment of custom list) #4881

Open

albertvillanova requested changes Aug 24, 2022

View reviewed changes

src/datasets/utils/resources/languages.json Outdated Show resolved Hide resolved

BenjaminGalliot force-pushed the patch-1 branch from c130481 to 0b1128a Compare August 24, 2022 10:01

Added names of less-studied languages.

747376b

Added names of less studied languages (with their Glottolog codes) for existing datasets: Yongning Na (yong1288) and Japhug (japh1234).

BenjaminGalliot force-pushed the patch-1 branch from 0b1128a to 747376b Compare August 24, 2022 10:20

albertvillanova requested changes Aug 24, 2022

View reviewed changes

src/datasets/utils/resources/languages.json Outdated Show resolved Hide resolved

src/datasets/utils/resources/languages.json Show resolved Hide resolved

src/datasets/utils/resources/languages.json Outdated Show resolved Hide resolved

Removed Glottolog codes (under discussion).

50542df

albertvillanova approved these changes Aug 24, 2022

View reviewed changes

albertvillanova merged commit a454bc9 into huggingface:main Aug 24, 2022

alexis-michaud mentioned this pull request Sep 13, 2022

For HuggingFace: indicating closest ISO 639-3 code? glottolog/glottolog-cldf#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added names of less-studied languages #4880

Added names of less-studied languages #4880

BenjaminGalliot commented Aug 23, 2022

albertvillanova left a comment •

edited

Loading

albertvillanova left a comment •

edited

Loading

BenjaminGalliot commented Aug 24, 2022

albertvillanova left a comment

HuggingFaceDocBuilderDev commented Aug 24, 2022

Added names of less-studied languages #4880

Added names of less-studied languages #4880

Conversation

BenjaminGalliot commented Aug 23, 2022

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

BenjaminGalliot commented Aug 24, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 24, 2022

albertvillanova left a comment •

edited

Loading

albertvillanova left a comment •

edited

Loading