Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

PTL 1.2 Compatibility #15

Merged
merged 27 commits into from
Mar 23, 2021
Merged

PTL 1.2 Compatibility #15

merged 27 commits into from
Mar 23, 2021

Conversation

amogkam
Copy link
Collaborator

@amogkam amogkam commented Mar 15, 2021

No description provided.

@amogkam amogkam requested a review from richardliaw March 15, 2021 16:10
@amogkam
Copy link
Collaborator Author

amogkam commented Mar 15, 2021

@amogkam
Copy link
Collaborator Author

amogkam commented Mar 18, 2021

Horovod Checkpointing blocked on Lightning-AI/pytorch-lightning#6585. Currently, we disable checkpointing for horovod in the tests.

Comment on lines +77 to +81
# echo "running ray_ddp_example.py" && python ray_ddp_example.py --smoke-test
# echo "running ray_ddp_example.py with Tune" && python ray_ddp_example.py --smoke-test --tune
# echo "running ray_ddp_tune.py" && python ray_ddp_tune.py --smoke-test
# echo "running ray_horovod_example.py" && python ray_horovod_example.py --smoke-test
# echo "running ray_horovod_example.py with Tune" && python ray_horovod_example.py --smoke-test --tune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to not run any of these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all use the MNIST dataset from torchvision which is failing right now due to this error https://discuss.pytorch.org/t/mnist-server-down/114433. After the next torchvision release we can re-enable these tests (and the ones on the Ray repo).

Comment on lines +138 to +142
# echo "running ray_ddp_example.py with Tune" && python ray_ddp_example.py --smoke-test --tune
# echo "running ray_ddp_tune.py" && python ray_ddp_tune.py --smoke-test
# echo "running ray_horovod_example.py" && python ray_horovod_example.py --smoke-test
# echo "running ray_horovod_example.py with Tune" && python ray_horovod_example.py --smoke-test --tune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to not run any?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Comment on lines +208 to +210
self.lightning_module.trainer.accelerator_connector\
._training_type_plugin = self
self.lightning_module.trainer.accelerator.training_type_plugin = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow that's nasty

Copy link
Collaborator Author

@amogkam amogkam Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is very hacky...but I had to set this back to self otherwise when we unserialize the model, the plugin used by the trainer was referring to a separate instance of the RayPlugin object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be reasonable to propose an abstraction to PTL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes possibly, though this is specific to the fact that we are serializing the plugin. Let me think about this some more, but I think this should be fine to keep for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can override the serialization codepath (like reduce or something) and have it reference a global variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though you're right- this wouldn't be a problem in the first place if PTL didn't have this circular reference.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, we should ping them to see if there's any abstractions we can lock in here (or ask them to expose)

@amogkam
Copy link
Collaborator Author

amogkam commented Mar 23, 2021

Thanks for the review @richardliaw. I think we should merge this PR as is so we can support PTL 1.2 asap. We should definitely start an engagement with the PTL team on any architectural changes to the plugin interface to make the code more maintainable here, but perhaps after getting some more users. I have been pushing small changes as needed directly to PTL.

@amogkam amogkam merged commit 562b63b into ray-project:main Mar 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants