Multi-database English LVCSR recipe #699

vijayaditya · 2016-04-14T05:04:08Z

We would like to design a recipe which combines

fisher+swbd (2100 hours)
tedlium (120 hr or 200 hr)
librispeech (1000 hr)
AMI (100 hr * 8 distant microphone + 100 hr close talk microphone = 900 hr)
WSJ (80 hr)
tasks for the acoustic model (AM) and language mode (LM) training.

This is an advanced task which requires experience with data preparation stage, lexicon creation and LM training.

This task would involve evaluating the system on Hub 2000 eval set, RT03 eval set, Librispeech test set, AMI-(SDM/MDM/IHM) eval set and Tedlium test set.

The AM will be built using the chain (lattice-free MMI) objective function. The LM will be built using the RNN-LM toolkit.

sikoried · 2016-04-14T21:22:28Z

@guoguo12 and I will look into this in the next weeks :-)

vijayaditya · 2016-05-09T04:50:29Z

@guoguo12 @sikoried Could you please provide timely updates on this progress ? This would help us schedule the other projects which rely on this recipe.

sikoried · 2016-05-09T05:51:21Z

We've started working on it, but Allen has finals the upcoming week. Is end of summer still ok, or should we expedite a bit?

vijayaditya · 2016-05-09T06:23:01Z

Nothing urgent. Our previous plan to get the recipe in shape by mid-August is still good for our plans. I just wanted to ensure that you had everything needed from our end.

guoguo12 · 2016-05-09T06:30:59Z

Yep, we're all set. We'll provide updates here as the project proceeds.

My working branch is guoguo12:multi-recipe. There's not much to see there right now. If you'd like, I can make a "WIP" pull request.

vijayaditya · 2016-05-09T06:38:03Z

@guoguo12 Thanks. A WIP PR is always desirable.

guoguo12 · 2016-05-20T23:52:32Z

@vijayaditya: When using existing data prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we 1) symlink/reference the script from our recipe or 2) make a copy of the script in our recipe's directory?

@danpovey: Would like your input on this as well.

danpovey · 2016-05-21T00:00:00Z

Regarding copying data-prep scripts vs. linking them... not sure. Maybe
copy them-- linking things within local isn't really the normal pattern.

Dan

On Fri, May 20, 2016 at 7:52 PM, Allen Guo notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya: When using existing data
prep scripts, e.g. egs/fisher_swbd/s5/local/swbd1_data_prep.sh, should we

symlink the script from our recipe or 2) make a copy of the script in
our recipe's directory?

@danpovey https://github.com/danpovey: Would like your input on this as
well.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

guoguo12 · 2016-05-21T00:06:11Z

Okay, thanks. The downside is that the copies may need to be manually synced if the originals are changed. That said, @sikoried thought of another solution (Option #3): Allow the user to specify (by commit hash) what version of each script to use, then automatically pull those versions from GitHub using sparse checkout.

danpovey · 2016-05-21T00:08:45Z

I think that's too complicated.
Bear in mind that if the upstream scripts get changed, they may be changed
in ways that are incompatible with the recipe you develop. So it may be
safer to force the manual syncing.
Dan

On Fri, May 20, 2016 at 8:06 PM, Allen Guo notifications@github.com wrote:

Okay, thanks. The downside is that the copies may need to be manually
synced if the originals are changed. That said, @sikoried
https://github.com/sikoried thought of another solution (Option #3
#3): Allow the user to specify
(by commit hash) what version of each script to use, then automatically
pull those versions from GitHub using sparse checkout.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

guoguo12 · 2016-05-21T00:15:38Z

Okay, we'll stick with copying. Thanks!

vijayaditya · 2016-05-21T04:27:33Z

It is always be good just to mention in the comment at the top of the
script where it was copied from (along with the commit id) and what was
changed compared to the previous script.

We are trying to do this in the new nnet3 scripts in local/. It would be
good to follow this for data prep scripts too.

Vijay
On May 21, 2016 05:45, "Allen Guo" notifications@github.com wrote:

Okay, we'll stick with copying. Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

guoguo12 · 2016-05-26T01:31:53Z

Status update: Finished copying and integrating the data prep scripts for the various corpora. The data directories should be ready to be combined now. I've also normalized the transcripts, and I've created a lexicon by combining the lexicons from fisher_swbd and tedlium. Next, I'll look into generating missing pronunciations in librispeech using Sequitur G2P and making those part of the lexicon as well.

danpovey · 2016-05-26T01:40:18Z

I think it might be a better idea to use CMUDict as the upstream lexicon
for the combined data, maybe using the Switchboard (MSU) and Cantab
lexicons, suitably mapped, for some unseen words. Of course you can try
both ways.

Recently we tried a combined setup with Librispeech and Switchboard, and
found the pronunciations obtained by g2p after training on the Switchboard
lexicon were worse than the CMUDict-derived pronunciations.

This will require some mapping. I believe the MSU pronunciations are not
quite the same as what you get from removing stress from CMUDict. Samuel
mentioned that one of the phones is spelled differently. It looks to me
like the Cantab lexicon is an extension of CMUDict (after stress removal).

On Wed, May 25, 2016 at 9:31 PM, Allen Guo notifications@github.com wrote:

Status update: Finished copying and integrating the data prep scripts for
the various corpora. The data directories should be ready to be combined
now. I've also normalized the transcripts, and I've created a lexicon by
combining the lexicons from fisher_swbd and tedlium. Next, I'll look into
generating missing pronunciations in librispeech using Sequitur G2P and
making those part of the lexicon as well.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

guoguo12 · 2016-05-26T02:23:20Z

So basically, 1) train a G2P model using CMUDict (after stress removal), 2) synthesize pronunciations for all words across all databases that are not in CMUDict, and 3) combine into a single lexicon?

danpovey · 2016-05-26T02:26:14Z

Something like that, but it might possibly be better to use the prons in
the cantab dictionary and in the MSU dictionary before you go to g2p, so
only use g2p as a last resort. Before you do that, though, you'd need to
work out how to map MSU pronunciations to a CMUDict-like format.

On Wed, May 25, 2016 at 10:23 PM, Allen Guo notifications@github.com
wrote:

So basically, 1) train a G2P model using CMUDict (after stress removal),
2) synthesize pronunciations for all words across all databases that are
not in CMUDict, and 3) combine into a single lexicon?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#699 (comment)

guoguo12 · 2016-05-26T02:34:36Z

Okay, I'll give it a shot. Thanks for the feedback!

drTonyRobinson · 2016-05-26T07:11:24Z

On 26/05/16 02:40, Daniel Povey wrote:

It looks to me like the Cantab lexicon is an extension of CMUDict
(after stress removal).

Just for clarity, this is correct.

Our main objective was to add a decent LM to TEDLIUM. That needed
pronunciations to run and IIRC at the time Kaldi wasn't using g2p.

Tony

Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
https://www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

vince62s · 2016-05-31T20:24:40Z

Hi,
For the users who do not have 1) either the computing resource or 2) access to some of the corpus, I would suggest to open the choice to pick or not some sources in a preliminary step, unless this is not at the goal of this project.

guoguo12 · 2016-06-01T00:44:13Z

@vince62s: Yep, @sikoried and I are taking that into consideration. Our current plan is to make the training steps as generic as possible. Our proposed data directory structure is:

data/
  multi_a/
    train_s1/  # data directory for stage 1 of training
    train_s2/  # data directory for stage 2 of training
    train_s3/  # data directory for stage 3 of training, etc.
  ...

Each train_s* directory will be configurable in terms of what corpora to include, what ratios of those corpora to use, and perhaps even what model parameters to use. We will provide an example build script that generates multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the build script, you'll be able to build your own multi_b directory that has the same structure as multi_a but different combinations of corpora at each step. And then we'll make the training steps independent of what corpora are being used: they will simply use multi_b instead of multi_a, as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or add corpora if you have additional data. It will also make it easy to evaluate how different combinations of data affect the results, e.g. "Does using 50% WSJ and 50% SWBD for the initial monophone model work better than using just WSJ?"

We think this solution strikes a good balance between hard-coding too much and over-automating too much. Suggestions are appreciated!

vijayaditya · 2016-06-01T07:15:27Z

Writing the data prep scripts to handle multiple corpora is good, but keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...), number
of epochs of training and many other hyper-parameters are usually adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into consideration.
Our current plan is to make the training steps as generic as possible. Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora to
include, what ratios of those corpora to use, and perhaps even what model
parameters to use. We will provide an example build script that generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the
build script, you'll be able to build your own multi_b directory that has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g. "Does
using 50% WSJ and 50% SWBD for the initial monophone model work better than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too much
and over-automating too much. Suggestions are appreciated!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK
.

jtrmal · 2016-06-01T08:01:34Z

Guys, I don't think tuning it for every possible combination is a good use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...), number
of epochs of training and many other hyper-parameters are usually adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into consideration.

Our current plan is to make the training steps as generic as possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora to
include, what ratios of those corpora to use, and perhaps even what model
parameters to use. We will provide an example build script that generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying the
build script, you'll be able to build your own multi_b directory that has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK
.

sikoried · 2016-06-01T13:59:55Z

Yeah we didn't intend to optimize parameters for more than one combination,
but rather architect the scripts so that it's fairly easy for someone else
to make modifications. Maybe to give a negative example: right now in other
recipes it's fairly distributed and hard coded which partitions and params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training script,
itll be easy for someone else to change the list of included datasets, and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local
are
preferably hard-coded as they are provided as example scripts, which are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying
the
build script, you'll be able to build your own multi_b directory that
has
the same structure as multi_a but different combinations of corpora at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have them
or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK
.

jtrmal · 2016-06-01T14:06:51Z

Yeah, Ok, but do not over-engineer it.
Y
On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one combination,
but rather architect the scripts so that it's fairly easy for someone else
to make modifications. Maybe to give a negative example: right now in other
recipes it's fairly distributed and hard coded which partitions and params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training script,
itll be easy for someone else to change the list of included datasets, and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good
use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right
away.
We can then distribute the models through openslr if there will be enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these configurable
while keeping the scripts simple to read. The scripts we keep in local
are
preferably hard-coded as they are provided as example scripts, which
are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as
possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3, then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by modifying
the
build script, you'll be able to build your own multi_b directory that
has
the same structure as multi_a but different combinations of corpora
at
each step. And then we'll make the training steps independent of what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have
them
or
add corpora if you have additional data. It will also make it easy to
evaluate how different combinations of data affect the results, e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work
better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding too
much
and over-automating too much. Suggestions are appreciated!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK
.

danpovey · 2016-06-01T17:54:44Z

Agreed RE not over-engineering it. Scripts with millions of options don't
generally make it easier to modify things, they make it harder.
Dan

On Wed, Jun 1, 2016 at 10:06 AM, jtrmal notifications@github.com wrote:

Yeah, Ok, but do not over-engineer it.
Y

On Jun 1, 2016 4:00 PM, "Korbinian" notifications@github.com wrote:

Yeah we didn't intend to optimize parameters for more than one
combination,
but rather architect the scripts so that it's fairly easy for someone
else
to make modifications. Maybe to give a negative example: right now in
other
recipes it's fairly distributed and hard coded which partitions and
params
(states, Gaussians, jobs, and the partitions themselves) are used for the
bootstrap. If you want to change something, you need to carefully go
through a long script and not miss a thing. With the abstraction of
partitions for stages and a more tidy/organized bootstrap training
script,
itll be easy for someone else to change the list of included datasets,
and
make the according param changes.

Korbinian.

On Wed, Jun 1, 2016, 01:01 jtrmal notifications@github.com wrote:

Guys, I don't think tuning it for every possible combination is a good
use
of our and Allen&Korbinian's time.
I'd say let's just make the scripts reasonably general and set-up the
infrastructure for the original corpora as outlined by Vijay. I don't
actually think it will be easy to make it work for all of them right
away.
We can then distribute the models through openslr if there will be
enough
interest (so people can use the models even without having bought the
corpora) -- but let's not just try to cater go anyone's selection of
the
corpora they have, that's not our mission.
y.

On Wed, Jun 1, 2016 at 9:15 AM, Vijayaditya Peddinti <
notifications@github.com> wrote:

Writing the data prep scripts to handle multiple corpora is good, but
keep
in mind that things like
model size (number of layers, size of layers, number of leaves,...),
number
of epochs of training and many other hyper-parameters are usually
adjusted
based on data size. So it would be bit tough to make these
configurable
while keeping the scripts simple to read. The scripts we keep in
local
are
preferably hard-coded as they are provided as example scripts, which
are
copied and modified, and not as callable scripts.

So I would recommend the following, write all your data prep scripts
so
that they can handle any combinations of databases. Provide multiple
run_*.sh scripts each of which operates on a particular database
combination and has all the hyper-parameters tuned for that
particular
data
size.

--Vijay

On Wed, Jun 1, 2016 at 6:14 AM, Allen Guo notifications@github.com
wrote:

@vince62s https://github.com/vince62s: Yep, @sikoried
https://github.com/sikoried and I are taking that into
consideration.

Our current plan is to make the training steps as generic as
possible.
Our
proposed data directory structure is:

data/
multi_a/
train_s1/ # data directory for stage 1 of training
train_s2/ # data directory for stage 2 of training
train_s3/ # data directory for stage 3 of training, etc.
...

Each train_s* directory will be configurable in terms of what
corpora
to
include, what ratios of those corpora to use, and perhaps even what
model
parameters to use. We will provide an example build script that
generates
multi_a using the "suggested" approach (e.g. WSJ for stages 1-3,
then
WSJ+Fisher+SWBD for stage 4, etc.). But the key is that, by
modifying
the
build script, you'll be able to build your own multi_b directory
that
has
the same structure as multi_a but different combinations of corpora
at
each step. And then we'll make the training steps independent of
what
corpora are being used: they will simply use multi_b instead of
multi_a,
as long as you specify to do so in run.sh.

This approach will make it easy to omit corpora if you don't have
them
or
add corpora if you have additional data. It will also make it easy
to
evaluate how different combinations of data affect the results,
e.g.
"Does
using 50% WSJ and 50% SWBD for the initial monophone model work
better
than
using just WSJ?"

We think this solution strikes a good balance between hard-coding
too
much
and over-automating too much. Suggestions are appreciated!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADtwoB3d-Yv_2Zqye9hgFsn9ksdi1LFdks5qHNXggaJpZM4IHCKK

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<
#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/AKisXz38N77RbkjU-gR2UnvAn0xkEZfDks5qHTGWgaJpZM4IHCKK

.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)
,
or mute the thread
<

https://github.com/notifications/unsubscribe/ADhueOwJiNlLOo5Zlj71TkkbYzU3S-L6ks5qHTxhgaJpZM4IHCKK

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
<
https://github.com/notifications/unsubscribe/AKisX2x4_whjpy03p7-ddODvwp90mL_iks5qHZBjgaJpZM4IHCKK

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADJVu4h_SF3u8g8VAe24LrkgcRI2UMxqks5qHZH-gaJpZM4IHCKK
.

sikoried · 2016-06-01T18:05:47Z

Sure guys, no worries ;-) It's all way simpler than you may have pictured now!

danpovey · 2016-06-02T18:52:14Z

I think uppercase vs lowercase doesn't matter, just choose either.
In AY vs AY1 AY2, that's a choice of whether you retain the numeric stress
markers in CMUDict. You may want to experiment with both options-- except
that if the recipe will require the MSU dictionary to help cover the
Switchboard data, then you may have to omit stress because the MSU
dictcionary doesn't have it.
What I am more concerned about is there may be other incompatibilities
specific to some phones, e.g. I was told by @xiaohui-zhang that the MSU
dictionary spells a particular phone differently, IIRC. After doing the
initial conversion for case and stress removal, just look for words that
have different prons in the 2 dictionaries, you may find it.

On Thu, Jun 2, 2016 at 8:04 AM, vince62s notifications@github.com wrote:

@guoguo12 https://github.com/guoguo12 not trying to be a pain :) , but
looking forward so asking another question RE: lexicon and LM. I saw Dan's
comment on mapping various lexicons [as a matter of fact even Librispeech
and Tedlium which I think are both derived from CMUdict do not have the
same phonemes eg AY / AY1 AY2] some on them being UPPERcase other lowercase
(same for LMs), is there one specfic choice on all of this ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADJVu98levuG-MmlL2hCeIBSMOINylzLks5qHsa0gaJpZM4IHCKK
.

guoguo12 · 2016-06-03T03:07:43Z

I've been using lowercase phones without stress markers. I didn't use the MSU dictionary. It seems that fewer than 2% of the words (not unique words, but actual words) used in the Switchboard transcripts are absent from CMUdict. Many of these are partial words with hyphens, like "th-" or "tha-".

vijayaditya · 2016-09-16T00:25:21Z

addressed in #771

vince62s · 2016-10-06T08:05:23Z

@vijayaditya Did someone actually run the nnet3 and/or chain on this multi-database ?

vijayaditya · 2016-10-06T09:45:33Z

Not yet.

Vijay

On Oct 6, 2016 09:05, "vince62s" notifications@github.com wrote:

@vijayaditya https://github.com/vijayaditya Did someone actually run
the nnet3 and/or chain on this multi-database ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#699 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADtwoGlOIBL1E0W0w-kSjGitQCO1gWJmks5qxKvGgaJpZM4IHCKK
.

migueljette · 2017-09-17T01:31:06Z

I'm curious. One year later. Has anybody run this through the nnet3 or chain model, @guoguo12 or @sikoried ?

guoguo12 · 2017-09-17T03:36:16Z

Sorry, I haven't.

migueljette · 2017-09-17T13:50:20Z

Ok, no worries. Thanks for the quick reply!

galv · 2017-09-17T18:41:15Z

Hey Allen, long time no talk. I'd personally be curious if anyone tries nnet3 or chain on this recipe. I get the feeling there's quite a bit of untapped potential here.

…

On Sun, Sep 17, 2017 at 6:50 AM, Mig ***@***.***> wrote: Ok, no worries. Thanks for the quick reply! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#699 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEi_UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK> .

-- Daniel Galvez

danpovey · 2017-09-17T18:48:10Z

Some of us at Hopkins are working on this recipe and I think we're going to run the chain recipe soon. I'll try to check in the relevan changes soon. On Sun, Sep 17, 2017 at 2:41 PM, Daniel Galvez <notifications@github.com> wrote:

…

Hey Allen, long time no talk. I'd personally be curious if anyone tries nnet3 or chain on this recipe. I get the feeling there's quite a bit of untapped potential here. On Sun, Sep 17, 2017 at 6:50 AM, Mig ***@***.***> wrote: > Ok, no worries. Thanks for the quick reply! > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#699 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AEi_ UOqYaDphLZOhsROt1GO2S1ebtYGRks5sjSOggaJpZM4IHCKK> > . > -- Daniel Galvez — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#699 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu39Pe9fvT5tNSznxoYEaTqBNAau0ks5sjWfPgaJpZM4IHCKK> .

ananthnagaraj · 2018-03-24T06:07:09Z

Hi,
Just wanted to check if nnet3 or chain run on this recipe.

xiaohui-zhang · 2018-03-24T15:54:48Z

I'm running experiments and will hopefully commit the recipe soon.

…

On Mar 24, 2018 2:07 AM, "ananthnagaraj" ***@***.***> wrote: Hi, Just wanted to check if nnet3 or chain run on this recipe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#699 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANiEERjeFipkpKg6OMLK5un89U9dVxueks5theKWgaJpZM4IHCKK> .

viju2008 · 2019-09-08T16:55:28Z

I want to combine LibriSpeech , TEDLIUM and Common Voice
Can modifying the script help me to achieve this

In fact it would be good if the creators of this recepie could provide it

JiayiFu · 2020-07-14T12:34:27Z

Sorry for commenting on this out of date issue.
Just want to check if any updated results for the chain recipe? @viju2008 @xiaohui-zhang Did you do some experiments?

vijayaditya added enhancement help wanted Please help us with this issue! labels Apr 14, 2016

vijayaditya removed the help wanted Please help us with this issue! label Apr 14, 2016

guoguo12 mentioned this issue May 9, 2016

WIP: Multi-database English LVCSR recipe #771

Merged

vijayaditya mentioned this issue Jun 27, 2016

Multi-condition multi-database English LVCSR recipe #870

Closed

vijayaditya closed this as completed Sep 16, 2016

Multi-database English LVCSR recipe #699

Multi-database English LVCSR recipe #699

Comments

vijayaditya commented Apr 14, 2016

sikoried commented Apr 14, 2016

vijayaditya commented May 9, 2016

sikoried commented May 9, 2016

vijayaditya commented May 9, 2016

guoguo12 commented May 9, 2016

vijayaditya commented May 9, 2016

guoguo12 commented May 20, 2016 • edited Loading

danpovey commented May 21, 2016

guoguo12 commented May 21, 2016

danpovey commented May 21, 2016

guoguo12 commented May 21, 2016

vijayaditya commented May 21, 2016

guoguo12 commented May 26, 2016

danpovey commented May 26, 2016

guoguo12 commented May 26, 2016

danpovey commented May 26, 2016

guoguo12 commented May 26, 2016

drTonyRobinson commented May 26, 2016

Tony

vince62s commented May 31, 2016

guoguo12 commented Jun 1, 2016

vijayaditya commented Jun 1, 2016

jtrmal commented Jun 1, 2016

sikoried commented Jun 1, 2016

jtrmal commented Jun 1, 2016

danpovey commented Jun 1, 2016

sikoried commented Jun 1, 2016

danpovey commented Jun 2, 2016

guoguo12 commented Jun 3, 2016

vijayaditya commented Sep 16, 2016 • edited Loading

vince62s commented Oct 6, 2016

vijayaditya commented Oct 6, 2016

migueljette commented Sep 17, 2017

guoguo12 commented Sep 17, 2017

migueljette commented Sep 17, 2017

galv commented Sep 17, 2017 via email

danpovey commented Sep 17, 2017 via email

ananthnagaraj commented Mar 24, 2018

xiaohui-zhang commented Mar 24, 2018 via email

viju2008 commented Sep 8, 2019 • edited Loading

JiayiFu commented Jul 14, 2020

guoguo12 commented May 20, 2016 •

edited

Loading

vijayaditya commented Sep 16, 2016 •

edited

Loading

viju2008 commented Sep 8, 2019 •

edited

Loading