Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-database English LVCSR recipe #699

Closed
vijayaditya opened this issue Apr 14, 2016 · 40 comments
Closed

Multi-database English LVCSR recipe #699

vijayaditya opened this issue Apr 14, 2016 · 40 comments

Comments

@vijayaditya
Copy link
Contributor

We would like to design a recipe which combines

  1. fisher+swbd (2100 hours)
  2. tedlium (120 hr or 200 hr)
  3. librispeech (1000 hr)
  4. AMI (100 hr * 8 distant microphone + 100 hr close talk microphone = 900 hr)
  5. WSJ (80 hr)
    tasks for the acoustic model (AM) and language mode (LM) training.

This is an advanced task which requires experience with data preparation stage, lexicon creation and LM training.

This task would involve evaluating the system on Hub 2000 eval set, RT03 eval set, Librispeech test set, AMI-(SDM/MDM/IHM) eval set and Tedlium test set.

The AM will be built using the chain (lattice-free MMI) objective function. The LM will be built using the RNN-LM toolkit.

@vijayaditya vijayaditya added enhancement help wanted Please help us with this issue! labels Apr 14, 2016
@sikoried
Copy link
Contributor

@guoguo12 and I will look into this in the next weeks :-)

@vijayaditya vijayaditya removed the help wanted Please help us with this issue! label Apr 14, 2016
@vijayaditya
Copy link
Contributor Author

@guoguo12 @sikoried Could you please provide timely updates on this progress ? This would help us schedule the other projects which rely on this recipe.

@sikoried
Copy link
Contributor

sikoried commented May 9, 2016

We've started working on it, but Allen has finals the upcoming week. Is end of summer still ok, or should we expedite a bit?

@vijayaditya
Copy link
Contributor Author

Nothing urgent. Our previous plan to get the recipe in shape by mid-August is still good for our plans. I just wanted to ensure that you had everything needed from our end.

@guoguo12
Copy link
Contributor

guoguo12 commented May 9, 2016

Yep, we're all set. We'll provide updates here as the project proceeds.

My working branch is guoguo12:multi-recipe. There's not much to see there right now. If you'd like, I can make a "WIP" pull request.

@vijayaditya
Copy link
Contributor Author

@guoguo12 Thanks. A WIP PR is always desirable.

@guoguo12
Copy link
Contributor

guoguo12 commented May 20, 2016

@vijayaditya: When using existing data prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we 1) symlink/reference the script from our recipe or 2) make a copy of the script in our recipe's directory?

@danpovey: Would like your input on this as well.

@danpovey
Copy link
Contributor

Regarding copying data-prep scripts vs. linking them... not sure. Maybe
copy them-- linking things within local isn't really the normal pattern.

Dan

On Fri, May 20, 2016 at 7:52 PM, Allen Guo notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya: When using existing data
prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we

  1. symlink the script from our recipe or 2) make a copy of the script in
    our recipe's directory?

@danpovey https://github.com/danpovey: Would like your input on this as
well.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

@guoguo12
Copy link
Contributor

Okay, thanks. The downside is that the copies may need to be manually synced if the originals are changed. That said, @sikoried thought of another solution (Option #3): Allow the user to specify (by commit hash) what version of each script to use, then automatically pull those versions from GitHub using sparse checkout.

@danpovey
Copy link
Contributor

I think that's too complicated.
Bear in mind that if the upstream scripts get changed, they may be changed
in ways that are incompatible with the recipe you develop. So it may be
safer to force the manual syncing.
Dan

On Fri, May 20, 2016 at 8:06 PM, Allen Guo notifications@github.com wrote:

Okay, thanks. The downside is that the copies may need to be manually
synced if the originals are changed. That said, @sikoried
https://github.com/sikoried thought of another solution (Option #3
#3): Allow the user to specify
(by commit hash) what version of each script to use, then automatically
pull those versions from GitHub using sparse checkout.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

@guoguo12
Copy link
Contributor

Okay, we'll stick with copying. Thanks!

@vijayaditya
Copy link
Contributor Author

It is always be good just to mention in the comment at the top of the
script where it was copied from (along with the commit id) and what was
changed compared to the previous script.

We are trying to do this in the new nnet3 scripts in local/. It would be
good to follow this for data prep scripts too.

Vijay
On May 21, 2016 05:45, "Allen Guo" notifications@github.com wrote:

Okay, we'll stick with copying. Thanks!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

@guoguo12
Copy link
Contributor

Status update: Finished copying and integrating the data prep scripts for the various corpora. The data directories should be ready to be combined now. I've also normalized the transcripts, and I've created a lexicon by combining the lexicons from fisher_swbd and tedlium. Next, I'll look into generating missing pronunciations in librispeech using Sequitur G2P and making those part of the lexicon as well.

@danpovey
Copy link
Contributor

I think it might be a better idea to use CMUDict as the upstream lexicon
for the combined data, maybe using the Switchboard (MSU) and Cantab
lexicons, suitably mapped, for some unseen words. Of course you can try
both ways.

Recently we tried a combined setup with Librispeech and Switchboard, and
found the pronunciations obtained by g2p after training on the Switchboard
lexicon were worse than the CMUDict-derived pronunciations.

This will require some mapping. I believe the MSU pronunciations are not
quite the same as what you get from removing stress from CMUDict. Samuel
mentioned that one of the phones is spelled differently. It looks to me
like the Cantab lexicon is an extension of CMUDict (after stress removal).

On Wed, May 25, 2016 at 9:31 PM, Allen Guo notifications@github.com wrote:

Status update: Finished copying and integrating the data prep scripts for
the various corpora. The data directories should be ready to be combined
now. I've also normalized the transcripts, and I've created a lexicon by
combining the lexicons from fisher_swbd and tedlium. Next, I'll look into
generating missing pronunciations in librispeech using Sequitur G2P and
making those part of the lexicon as well.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

@guoguo12
Copy link
Contributor

So basically, 1) train a G2P model using CMUDict (after stress removal), 2) synthesize pronunciations for all words across all databases that are not in CMUDict, and 3) combine into a single lexicon?

@danpovey
Copy link
Contributor

Something like that, but it might possibly be better to use the prons in
the cantab dictionary and in the MSU dictionary before you go to g2p, so
only use g2p as a last resort. Before you do that, though, you'd need to
work out how to map MSU pronunciations to a CMUDict-like format.

On Wed, May 25, 2016 at 10:23 PM, Allen Guo notifications@github.com
wrote:

So basically, 1) train a G2P model using CMUDict (after stress removal),
2) synthesize pronunciations for all words across all databases that are
not in CMUDict, and 3) combine into a single lexicon?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

@guoguo12
Copy link
Contributor

Okay, I'll give it a shot. Thanks for the feedback!

@drTonyRobinson
Copy link

On 26/05/16 02:40, Daniel Povey wrote:

It looks to me like the Cantab lexicon is an extension of CMUDict
(after stress removal).

Just for clarity, this is correct.

Our main objective was to add a decent LM to TEDLIUM. That needed
pronunciations to run and IIRC at the time Kaldi wasn't using g2p.

Tony

Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
https://www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

@vince62s
Copy link
Contributor

Hi,
For the users who do not have 1) either the computing resource or 2) access to some of the corpus, I would suggest to open the choice to pick or not some sources in a preliminary step, unless this is not at the goal of this project.

@guoguo12
Copy link
Contributor

guoguo12 commented Jun 1, 2016

@vince62s: Yep, @sikoried and I are taking that into consideration. Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/
  multi_a/
    train_s1/  # data directory for stage 1 of training
    train_s2/  # data directory for stage 2 of training
    train_s3/  # data directory for stage 3 of training, etc.
  ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

@vijayaditya
Copy link
Contributor Author

Writing the data prep scripts to handle multiple corpora is good, but keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...), number
of epochs of training and many other hyper-parameters are usually adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into consideration.
Our current plan is to make the training steps as generic as possible. Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora to
include, what ratios of those corpora to use, and perhaps even what model
parameters to use. We will provide an example build script that generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the
build script, you'll be able to build your own multi_b directory that has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g. "Does
using 50% WSJ and 50% SWBD for the initial monophone model work better than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too much
and over-automating too much. Suggestions are appreciated!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK
.

@jtrmal
Copy link
Contributor

jtrmal commented Jun 1, 2016

Guys, I don't think tuning it for every possible combination is a good use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...), number
of epochs of training and many other hyper-parameters are usually adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora to
include, what ratios of those corpora to use, and perhaps even what model
parameters to use. We will provide an example build script that generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the
build script, you'll be able to build your own multi_b directory that has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK
.

@sikoried
Copy link
Contributor

sikoried commented Jun 1, 2016

Yeah we didn't intend to optimize parameters for more than one combination,
but rather architect the scripts so that it's fairly easy for someone else
to make modifications. Maybe to give a negative example: right now in other
recipes it's fairly distributed and hard coded which partitions and params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training script,
itll be easy for someone else to change the list of included datasets, and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local
are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying
the
build script, you'll be able to build your own multi_b directory that
has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them
or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK
.

@jtrmal
Copy link
Contributor

jtrmal commented Jun 1, 2016

Yeah, Ok, but do not over-engineer it.
Y
On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one combination,
but rather architect the scripts so that it's fairly easy for someone else
to make modifications. Maybe to give a negative example: right now in other
recipes it's fairly distributed and hard coded which partitions and params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training script,
itll be easy for someone else to change the list of included datasets, and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good
use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right
away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local
are
preferably hard-coded as they are provided as example scripts, which
are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as
possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying
the
build script, you'll be able to build your own multi_b directory that
has
the same structure as multi_a but different combinations of corpora
at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have
them
or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work
better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK
.

@danpovey
Copy link
Contributor

danpovey commented Jun 1, 2016

Agreed RE not over-engineering it. Scripts with millions of options don't
generally make it easier to modify things, they make it harder.
Dan

On Wed, Jun 1, 2016 at 10:06 AM, jtrmal notifications@github.com wrote:

Yeah, Ok, but do not over-engineer it.
Y

On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one
combination,
but rather architect the scripts so that it's fairly easy for someone
else
to make modifications. Maybe to give a negative example: right now in
other
recipes it's fairly distributed and hard coded which partitions and
params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training
script,
itll be easy for someone else to change the list of included datasets,
and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good
use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right
away.
We can then distribute the models through openslr if there will be
enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of
the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these
configurable
while keeping the scripts simple to read. The scripts we keep in
local
are
preferably hard-coded as they are provided as example scripts, which
are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts
so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that
particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as
possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what
corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3,
then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by
modifying
the
build script, you'll be able to build your own multi_b directory
that
has
the same structure as multi_a but different combinations of corpora
at
each step. And then we'll make the training steps independent of
what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have
them
or
add corpora if you have additional data. It will also make it easy
to
evaluate how different combinations of data affect the results,
e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work
better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding
too
much
and over-automating too much. Suggestions are appreciated!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADJVu4h_SF3u8g8VAe24LrkgcRI2UMxqks5qHZH-gaJpZM4IHCKK
.

@sikoried
Copy link
Contributor

sikoried commented Jun 1, 2016

Sure guys, no worries ;-) It's all way simpler than you may have pictured now!

@danpovey
Copy link
Contributor

danpovey commented Jun 2, 2016

I think uppercase vs lowercase doesn't matter, just choose either.
In AY vs AY1 AY2, that's a choice of whether you retain the numeric stress
markers in CMUDict. You may want to experiment with both options-- except
that if the recipe will require the MSU dictionary to help cover the
Switchboard data, then you may have to omit stress because the MSU
dictcionary doesn't have it.
What I am more concerned about is there may be other incompatibilities
specific to some phones, e.g. I was told by @xiaohui-zhang that the MSU
dictionary spells a particular phone differently, IIRC. After doing the
initial conversion for case and stress removal, just look for words that
have different prons in the 2 dictionaries, you may find it.

On Thu, Jun 2, 2016 at 8:04 AM, vince62s notifications@github.com wrote:

@guoguo12 https://github.com/guoguo12 not trying to be a pain :) , but
looking forward so asking another question RE: lexicon and LM. I saw Dan's
comment on mapping various lexicons [as a matter of fact even Librispeech
and Tedlium which I think are both derived from CMUdict do not have the
same phonemes eg AY / AY1 AY2] some on them being UPPERcase other lowercase
(same for LMs), is there one specfic choice on all of this ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADJVu98levuG-MmlL2hCeIBSMOINylzLks5qHsa0gaJpZM4IHCKK
.

@guoguo12
Copy link
Contributor

guoguo12 commented Jun 3, 2016

I've been using lowercase phones without stress markers. I didn't use the MSU dictionary. It seems that fewer than 2% of the words (not unique words, but actual words) used in the Switchboard transcripts are absent from CMUdict. Many of these are partial words with hyphens, like "th-" or "tha-".

@vijayaditya
Copy link
Contributor Author

vijayaditya commented Sep 16, 2016

addressed in #771

@vince62s
Copy link
Contributor

vince62s commented Oct 6, 2016

@vijayaditya Did someone actually run the nnet3 and/or chain on this multi-database ?

@vijayaditya
Copy link
Contributor Author

Not yet.

Vijay

On Oct 6, 2016 09:05, "vince62s" notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya Did someone actually run
the nnet3 and/or chain on this multi-database ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADtwoGlOIBL1E0W0w-kSjGitQCO1gWJmks5qxKvGgaJpZM4IHCKK
.

@migueljette
Copy link

I'm curious. One year later. Has anybody run this through the nnet3 or chain model, @guoguo12 or @sikoried ?

@guoguo12
Copy link
Contributor

Sorry, I haven't.

@migueljette
Copy link

Ok, no worries. Thanks for the quick reply!

@galv
Copy link
Contributor

galv commented Sep 17, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Sep 17, 2017 via email

@ananthnagaraj
Copy link

Hi,
Just wanted to check if nnet3 or chain run on this recipe.

@xiaohui-zhang
Copy link
Contributor

xiaohui-zhang commented Mar 24, 2018 via email

@viju2008
Copy link

viju2008 commented Sep 8, 2019

I want to combine LibriSpeech , TEDLIUM and Common Voice
Can modifying the script help me to achieve this

In fact it would be good if the creators of this recepie could provide it

@JiayiFu
Copy link

JiayiFu commented Jul 14, 2020

Sorry for commenting on this out of date issue.
Just want to check if any updated results for the chain recipe? @viju2008 @xiaohui-zhang Did you do some experiments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests