Add rank param to Checkpoint #2633

sadra-barikbin · 2022-07-27T22:52:54Z

Fixes #2623

Description:

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

vfdev-5

Thanks for the PR @sadra-barikbin !
I haven't checked everything and will continue the review later

ignite/handlers/checkpoint.py

vfdev-5 · 2022-07-28T12:20:44Z

ignite/handlers/checkpoint.py

+                # all tpu procs should enter here as internally performs sync across device
+                self._save_func(checkpoint, path, xm.save)


For XLA all procs should enter xm.save, but now you added if self.save_on_rank == idist.get_rank(): on the top, so only one proc will enter this function. Am I missing something?

I corrected it

vfdev-5 · 2022-07-28T12:31:27Z

ignite/handlers/checkpoint.py

+        When running on XLA devices or using :class:`~torch.distributed.optim.ZeroRedundancyOptimizer`, it
+        should be run in all processes, otherwise application can get stuck on saving the checkpoint.


We may need to rephrase this sentence...

How exactly do you mean?

vfdev-5 · 2022-08-10T22:51:31Z

ignite/handlers/checkpoint.py


 import torch
 import torch.nn as nn
+from torch.distributed.optim import ZeroRedundancyOptimizer


Ignite is supposed to work with pytorch 1.3.1 where ZeroRedundancyOptimizer is absent. We have to deal with this import differently.

…-2623

Implement feature

b4f0054

sadra-barikbin requested a review from vfdev-5 July 27, 2022 22:53

github-actions bot added the module: handlers Core Handlers module label Jul 27, 2022

Fix bug in docstring

c9e2143

vfdev-5 reviewed Jul 28, 2022

View reviewed changes

Fix bugs and tests

09cff0a

vfdev-5 mentioned this pull request Aug 10, 2022

Upgrading to 0.4.9 stuck multi-gpu training #2635

Open

vfdev-5 reviewed Aug 10, 2022

View reviewed changes

Merge branch 'master' into feature-add-rank-param-to-checkpoint-issue…

40eadb4

…-2623

sadra-barikbin closed this Aug 13, 2022

sadra-barikbin deleted the feature-add-rank-param-to-checkpoint-issue-2623 branch August 13, 2022 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rank param to Checkpoint #2633

Add rank param to Checkpoint #2633

sadra-barikbin commented Jul 27, 2022

vfdev-5 left a comment

vfdev-5 Jul 28, 2022

sadra-barikbin Aug 10, 2022

vfdev-5 Jul 28, 2022

sadra-barikbin Aug 10, 2022

vfdev-5 Aug 10, 2022

		# all tpu procs should enter here as internally performs sync across device
		self._save_func(checkpoint, path, xm.save)

		When running on XLA devices or using :class:`~torch.distributed.optim.ZeroRedundancyOptimizer`, it
		should be run in all processes, otherwise application can get stuck on saving the checkpoint.

Add rank param to Checkpoint #2633

Add rank param to Checkpoint #2633

Conversation

sadra-barikbin commented Jul 27, 2022

vfdev-5 left a comment

Choose a reason for hiding this comment

vfdev-5 Jul 28, 2022

Choose a reason for hiding this comment

sadra-barikbin Aug 10, 2022

Choose a reason for hiding this comment

vfdev-5 Jul 28, 2022

Choose a reason for hiding this comment

sadra-barikbin Aug 10, 2022

Choose a reason for hiding this comment

vfdev-5 Aug 10, 2022

Choose a reason for hiding this comment