Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rank param to Checkpoint #2633

Conversation

sadra-barikbin
Copy link
Collaborator

Fixes #2623

Description:

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: handlers Core Handlers module label Jul 27, 2022
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @sadra-barikbin !
I haven't checked everything and will continue the review later

ignite/handlers/checkpoint.py Outdated Show resolved Hide resolved
Comment on lines 825 to 826
# all tpu procs should enter here as internally performs sync across device
self._save_func(checkpoint, path, xm.save)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For XLA all procs should enter xm.save, but now you added if self.save_on_rank == idist.get_rank(): on the top, so only one proc will enter this function. Am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected it

Comment on lines 175 to 176
When running on XLA devices or using :class:`~torch.distributed.optim.ZeroRedundancyOptimizer`, it
should be run in all processes, otherwise application can get stuck on saving the checkpoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to rephrase this sentence...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly do you mean?


import torch
import torch.nn as nn
from torch.distributed.optim import ZeroRedundancyOptimizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignite is supposed to work with pytorch 1.3.1 where ZeroRedundancyOptimizer is absent. We have to deal with this import differently.

@sadra-barikbin sadra-barikbin deleted the feature-add-rank-param-to-checkpoint-issue-2623 branch August 13, 2022 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: handlers Core Handlers module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate ZeRO state before checkpoint saving
2 participants