-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject templates for ASR datasets #2565
Conversation
This reverts commit daf34ea.
Wait until #2567 is merged so we can benefit from the tagger :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
Great work for the tags as well
A few comments about some language codes that we are missing in our languages list:
datasets/common_voice/README.md
Outdated
- zh-CN | ||
- zh-HK | ||
- zh-TW | ||
- zh-Hans-CN | ||
- zh-Hans-HK | ||
- zh-Hant-TW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep zh-CN, zh-HK and zh-TW since they are valid codes according to BCP-47
The Hans
subtag means that the text uses the Simplified Chinese script
Therefore I'd suggest to add these three to our list of supported languages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will add them to datasets
!
datasets/common_voice/README.md
Outdated
@@ -43,11 +44,10 @@ languages: | |||
- mt | |||
- nl | |||
- or | |||
- pa-IN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could add this language to our list of supported languages
datasets/common_voice/README.md
Outdated
- rm-sursilv | ||
- rm-vallader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are valid BCP-47 codes according to wikipedia, we can add them to our list of languages
thanks for the feedback @lhoestq! i've added the new language codes and this PR should be ready for a merge :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ^^
This PR adds ASR templates for 5 of the most common speech datasets on the Hub, where "common" is defined by the number of models trained on them.
I also fixed a bunch of the tags in the READMEs 😎