-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formalize progress tracking inside of the trainer internals #6429
Comments
That sounds interesting! In your scenario: Would the trainer keep track of the different metrics and then deposit the data in the model, so that the data survives the lifecycle of the trainer? |
I really like this. This could also serve as part of the interface for custom loops to interact with the trainer state. cc: @justusschock But I wouldn't do it via a callback, i'd have it as a property
This can be a property calculated from the others, but what do you sum? just training and validation? also test?
We can have them, just because why not, even though they shouldn't impact the trainer state
The progress bar could be generated from this data and the current metrics available
To me it needs to be out of the loops and out of the callbacks into the trainer state. The loops would modify it and the callbacks would read it.
This is a great observation. We could also put a tracker in the model so with just a model checkpoint, you can know exactly how many batches has it seen. This model tracker would outlive the trainer and the trainer tracker would add to the model state. For example, loading a model trained for 1000 epochs, and training for 10 epochs would mean that |
Yes totally @carmocca I should've refined this further.
|
Will this integrate with the supported loggers? What I would like to be able to do is pause training mid epoch, and resume training at that point using a new instance of Trainer and the same logger as before by passing resume_from_checkpoint=checkpoint_path? |
@willleeney No, this won't happen. This is just tracking the trainer states. Resuming to work correctly has to be implemented by each logger individually (which often is not that trivial, since they handle it in very different ways). Also this is not dumped to the checkpoint. |
@justusschock no worries, as long as it creates the mid-epoch resumable state for training, it'll be good enough for me :) |
@carmocca If there is anything I can help with, I would be happy to chime in since that is a blocker for my work anyway. |
@wittenator I currently have a quick workaround that solves my issue with the logger until this is done but I still have an issue relating to this... It would be really useful if the trainer state could switch between multiple data loaders on request, executing a custom number of training batch steps, so that both data loaders don't iterate through at the same time? Sorry I don't know if this is the right place to ask or if this is already implemented? But what I need is to be able to keep track of how far through each data loader I am, in order to iterate a through a chosen number of batches ? |
@wittenator @willleeney We plan to resolve all those issues, but it will be done in several steps so the best way you can help right now is to keep track of the PRs and review them 😄 |
🚀 Feature
We should better enforce progress tracking across these dimensions:
Stage: training, evaluation (validation and test), and prediction loops
Granularity: batches, steps, epochs
Motivation
Pitch
See the full example here
The train loop extends this to track optimizer steps
Trainer
maintains its own tracker that has references to the individual loop progress trackersThe text was updated successfully, but these errors were encountered: