Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] <Part 2> Support metric logging and checkpointing for LightningTrainer #33183
[AIR] <Part 2> Support metric logging and checkpointing for LightningTrainer #33183
Changes from 39 commits
4c8edc1
e0fd5d9
ebea347
01f9461
22725b6
b3ab873
6ba0577
a9d8ab6
dde861d
55b7916
a75fec6
e1b9d81
2e6500f
895ab43
f76f48d
c1448db
66ab606
2842b14
38cd101
5210cb7
300ba71
1935473
55a7b32
6598894
171a5e8
247ba48
71b24fa
6eea366
8fe5b05
cc8d3e8
e606682
5cc5a42
7bac4e7
b009e0f
ff89639
40cc5dd
9403bf5
a88e57a
c57c310
55918b9
d4be393
9bb4136
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is metrics reporting part of the checkpoint class?
what if I want to report data / iteration, but don't want to create checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context here is, checkpointing and logging are separate logics in Lightning. Checkpoint class can access metrics and checkpoint, but Logger can only access metrics. In order to report checkpoint and metrics together, we implement reporting in checkpoint class.
For logging, we recommend the users keep using lightning's native Loggers(e.g. wandb, mlflow, tensorboard loggers). They can control the logging frequency by themselves and retrieve logs as usual, which is less intrusive and aligns better with user habits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't feel particularly safe about this. why do we have this keyword, and it's not even __ prefixed ...
also we may be logging this warning msg every time this is called.
is there a list of such keywords defined somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I tried but didn't relevant keys in
ray.air.constants
. I removed this warning msg and changed the key name to__report_on
now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need this. Let's just
return
early in the cases where no checkpoint is found. Better not to mess with this private method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that override this method is not elegant, but it seems to be the only way. The name of the last checkpoint is always "last.ckpt", so we can't determine if there's a new checkpoint based on the file name alone.
To give you more context, the logic of
ModelCheckpoint.on_train_batch_end()
is like:Since lightning did not modularize the code block into a self.should_checkpoint() function, we can only track whether
_save_top_k_checkpoint()
and_save_last_checkpoint()
are being called.