-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed distributed training support #1988
Conversation
I won’t have time or enough internet access to review this until Monday, but could you make sure you remove all references in READMEs and docs? Maybe nix the docs/ folder altogether? It’s severely out of date at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great, just a couple of comments. Really good to clean some of this code!
@@ -388,9 +384,11 @@ def train(server=None): | |||
train_set = DataSet(train_data, | |||
FLAGS.train_batch_size, | |||
limit=FLAGS.limit_train, | |||
next_index=lambda i: coord.get_next_index('train')) | |||
next_index=train_index.inc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: next_index=lambda i: train_index += 1
and then we can avoid defining the SampleIndex
class and just make this variable an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I first tried. In Python x += 1
is not resulting in a number and there is no non-hacky way to have multiple instructions in lambdas. If you go for an inline function instead, you've to make train_index
global, as in Python non-local scope variables are only mutable if they are global. You'd have to do all of this twice for train and dev and it will get more verbose than the existing (cleaner) solution. This is Python...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's unfortunate. Alright!
DeepSpeech.py
Outdated
|
||
current_epoch = coord._epoch-1 | ||
# Checkpointing | ||
epoch_saver = tf.train.Saver(max_to_keep=FLAGS.max_to_keep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we had checkpoints saved every 10 minutes so that we could recover from a crash without losing too much training time. This is now only saving on every epoch end. Saving checkpoints at the end of epochs is nice to have, but I think saving every 10 minutes is a good safety feature to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address this...
test() | ||
|
||
if FLAGS.export_dir: | ||
export() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is just lovely :D
Test failures were just version checks due to missing tags in your fork, merging. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR also adds support for automatically checkpointing the epoch-model with best dev-loss so far.
@reuben Main activity is in the training-loop and I tried to keep changes there - apart from trivial removal of stuff (coordinator and config/FLAGS etc.). Especially the feeder is kept as it was/is. So rebasing your big data-PR is hard for the training part but should still be possible.