-
Notifications
You must be signed in to change notification settings - Fork 34
Conversation
Horovod Checkpointing blocked on Lightning-AI/pytorch-lightning#6585. Currently, we disable checkpointing for horovod in the tests. |
# echo "running ray_ddp_example.py" && python ray_ddp_example.py --smoke-test | ||
# echo "running ray_ddp_example.py with Tune" && python ray_ddp_example.py --smoke-test --tune | ||
# echo "running ray_ddp_tune.py" && python ray_ddp_tune.py --smoke-test | ||
# echo "running ray_horovod_example.py" && python ray_horovod_example.py --smoke-test | ||
# echo "running ray_horovod_example.py with Tune" && python ray_horovod_example.py --smoke-test --tune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to not run any of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all use the MNIST dataset from torchvision which is failing right now due to this error https://discuss.pytorch.org/t/mnist-server-down/114433. After the next torchvision release we can re-enable these tests (and the ones on the Ray repo).
# echo "running ray_ddp_example.py with Tune" && python ray_ddp_example.py --smoke-test --tune | ||
# echo "running ray_ddp_tune.py" && python ray_ddp_tune.py --smoke-test | ||
# echo "running ray_horovod_example.py" && python ray_horovod_example.py --smoke-test | ||
# echo "running ray_horovod_example.py with Tune" && python ray_horovod_example.py --smoke-test --tune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to not run any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment.
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
self.lightning_module.trainer.accelerator_connector\ | ||
._training_type_plugin = self | ||
self.lightning_module.trainer.accelerator.training_type_plugin = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow that's nasty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is very hacky...but I had to set this back to self otherwise when we unserialize the model, the plugin used by the trainer was referring to a separate instance of the RayPlugin object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be reasonable to propose an abstraction to PTL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes possibly, though this is specific to the fact that we are serializing the plugin. Let me think about this some more, but I think this should be fine to keep for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can override the serialization codepath (like reduce or something) and have it reference a global variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though you're right- this wouldn't be a problem in the first place if PTL didn't have this circular reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally, we should ping them to see if there's any abstractions we can lock in here (or ask them to expose)
Thanks for the review @richardliaw. I think we should merge this PR as is so we can support PTL 1.2 asap. We should definitely start an engagement with the PTL team on any architectural changes to the plugin interface to make the code more maintainable here, but perhaps after getting some more users. I have been pushing small changes as needed directly to PTL. |
No description provided.