-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-database English LVCSR recipe #699
Comments
@guoguo12 and I will look into this in the next weeks :-) |
We've started working on it, but Allen has finals the upcoming week. Is end of summer still ok, or should we expedite a bit? |
Nothing urgent. Our previous plan to get the recipe in shape by mid-August is still good for our plans. I just wanted to ensure that you had everything needed from our end. |
Yep, we're all set. We'll provide updates here as the project proceeds. My working branch is guoguo12:multi-recipe. There's not much to see there right now. If you'd like, I can make a "WIP" pull request. |
@guoguo12 Thanks. A WIP PR is always desirable. |
@vijayaditya: When using existing data prep scripts, e.g. @danpovey: Would like your input on this as well. |
Regarding copying data-prep scripts vs. linking them... not sure. Maybe Dan On Fri, May 20, 2016 at 7:52 PM, Allen Guo notifications@github.com wrote:
|
Okay, thanks. The downside is that the copies may need to be manually synced if the originals are changed. That said, @sikoried thought of another solution (Option #3): Allow the user to specify (by commit hash) what version of each script to use, then automatically pull those versions from GitHub using sparse checkout. |
I think that's too complicated. On Fri, May 20, 2016 at 8:06 PM, Allen Guo notifications@github.com wrote:
|
Okay, we'll stick with copying. Thanks! |
It is always be good just to mention in the comment at the top of the We are trying to do this in the new nnet3 scripts in local/. It would be Vijay
|
Status update: Finished copying and integrating the data prep scripts for the various corpora. The data directories should be ready to be combined now. I've also normalized the transcripts, and I've created a lexicon by combining the lexicons from fisher_swbd and tedlium. Next, I'll look into generating missing pronunciations in librispeech using Sequitur G2P and making those part of the lexicon as well. |
I think it might be a better idea to use CMUDict as the upstream lexicon Recently we tried a combined setup with Librispeech and Switchboard, and This will require some mapping. I believe the MSU pronunciations are not On Wed, May 25, 2016 at 9:31 PM, Allen Guo notifications@github.com wrote:
|
So basically, 1) train a G2P model using CMUDict (after stress removal), 2) synthesize pronunciations for all words across all databases that are not in CMUDict, and 3) combine into a single lexicon? |
Something like that, but it might possibly be better to use the prons in On Wed, May 25, 2016 at 10:23 PM, Allen Guo notifications@github.com
|
Okay, I'll give it a shot. Thanks for the feedback! |
On 26/05/16 02:40, Daniel Povey wrote:
Just for clarity, this is correct. Our main objective was to add a decent LM to TEDLIUM. That needed TonySpeechmatics is a trading name of Cantab Research Limited |
Hi, |
@vince62s: Yep, @sikoried and I are taking that into consideration. Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:
Each This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?" We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated! |
Writing the data prep scripts to handle multiple corpora is good, but keep So I would recommend the following, write all your data prep scripts so --Vijay On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:
|
Guys, I don't think tuning it for every possible combination is a good use On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
|
Yeah we didn't intend to optimize parameters for more than one combination, Korbinian. On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:
|
Yeah, Ok, but do not over-engineer it.
|
Agreed RE not over-engineering it. Scripts with millions of options don't On Wed, Jun 1, 2016 at 10:06 AM, jtrmal notifications@github.com wrote:
|
Sure guys, no worries ;-) It's all way simpler than you may have pictured now! |
I think uppercase vs lowercase doesn't matter, just choose either. On Thu, Jun 2, 2016 at 8:04 AM, vince62s notifications@github.com wrote:
|
I've been using lowercase phones without stress markers. I didn't use the MSU dictionary. It seems that fewer than 2% of the words (not unique words, but actual words) used in the Switchboard transcripts are absent from CMUdict. Many of these are partial words with hyphens, like "th-" or "tha-". |
addressed in #771 |
@vijayaditya Did someone actually run the nnet3 and/or chain on this multi-database ? |
Not yet. Vijay On Oct 6, 2016 09:05, "vince62s" notifications@github.com wrote:
|
Sorry, I haven't. |
Ok, no worries. Thanks for the quick reply! |
Hey Allen, long time no talk.
I'd personally be curious if anyone tries nnet3 or chain on this recipe. I
get the feeling there's quite a bit of untapped potential here.
…On Sun, Sep 17, 2017 at 6:50 AM, Mig ***@***.***> wrote:
Ok, no worries. Thanks for the quick reply!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEi_UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK>
.
--
Daniel Galvez
|
Some of us at Hopkins are working on this recipe and I think we're going to
run the chain recipe soon. I'll try to check in the relevan changes soon.
On Sun, Sep 17, 2017 at 2:41 PM, Daniel Galvez <notifications@github.com>
wrote:
… Hey Allen, long time no talk.
I'd personally be curious if anyone tries nnet3 or chain on this recipe. I
get the feeling there's quite a bit of untapped potential here.
On Sun, Sep 17, 2017 at 6:50 AM, Mig ***@***.***> wrote:
> Ok, no worries. Thanks for the quick reply!
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#699 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AEi_
UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK>
> .
>
--
Daniel Galvez
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu39Pe9fvT5tNSznxoYEaTqBNAau0ks5sjWfPgaJpZM4IHCKK>
.
|
Hi, |
I'm running experiments and will hopefully commit the recipe soon.
…On Mar 24, 2018 2:07 AM, "ananthnagaraj" ***@***.***> wrote:
Hi,
Just wanted to check if nnet3 or chain run on this recipe.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANiEERjeFipkpKg6OMLK5un89U9dVxueks5theKWgaJpZM4IHCKK>
.
|
I want to combine LibriSpeech , TEDLIUM and Common Voice In fact it would be good if the creators of this recepie could provide it |
Sorry for commenting on this out of date issue. |
We would like to design a recipe which combines
tasks for the acoustic model (AM) and language mode (LM) training.
This is an advanced task which requires experience with data preparation stage, lexicon creation and LM training.
This task would involve evaluating the system on Hub 2000 eval set, RT03 eval set, Librispeech test set, AMI-(SDM/MDM/IHM) eval set and Tedlium test set.
The AM will be built using the chain (lattice-free MMI) objective function. The LM will be built using the RNN-LM toolkit.
The text was updated successfully, but these errors were encountered: