Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject templates for ASR datasets #2565

Merged
merged 17 commits into from
Jul 5, 2021

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Jun 29, 2021

This PR adds ASR templates for 5 of the most common speech datasets on the Hub, where "common" is defined by the number of models trained on them.

I also fixed a bunch of the tags in the READMEs 😎

@lewtun
Copy link
Member Author

lewtun commented Jun 29, 2021

Wait until #2567 is merged so we can benefit from the tagger :)

@lewtun lewtun marked this pull request as ready for review July 1, 2021 13:30
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !
Great work for the tags as well

A few comments about some language codes that we are missing in our languages list:

Comment on lines 64 to 66
- zh-CN
- zh-HK
- zh-TW
- zh-Hans-CN
- zh-Hans-HK
- zh-Hant-TW
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep zh-CN, zh-HK and zh-TW since they are valid codes according to BCP-47

The Hans subtag means that the text uses the Simplified Chinese script

Therefore I'd suggest to add these three to our list of supported languages

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will add them to datasets!

@@ -43,11 +44,10 @@ languages:
- mt
- nl
- or
- pa-IN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add this language to our list of supported languages

Comment on lines 49 to 50
- rm-sursilv
- rm-vallader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are valid BCP-47 codes according to wikipedia, we can add them to our list of languages

@lewtun lewtun mentioned this pull request Jul 5, 2021
@lewtun
Copy link
Member Author

lewtun commented Jul 5, 2021

thanks for the feedback @lhoestq! i've added the new language codes and this PR should be ready for a merge :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ^^

@lhoestq lhoestq merged commit dd3fe1f into huggingface:master Jul 5, 2021
@lewtun lewtun mentioned this pull request Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants