-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sagemaker DDP Plugin #6271
Add Sagemaker DDP Plugin #6271
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6271 +/- ##
========================================
- Coverage 93% 86% -7%
========================================
Files 212 216 +4
Lines 13720 14070 +350
========================================
- Hits 12751 12035 -716
- Misses 969 2035 +1066 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good !
Questions:
- I think you can subclass DDPPlugin directly.
- What about DDPSpawn too ?
- And Should we add an example with Sagemaker API ?
|
||
def __init__(self): | ||
if not _SMDIST_AVAILABLE: | ||
raise MisconfigurationException("`smdistributed` module is not available.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a small description on how to make this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise MisconfigurationException("`smdistributed` module is not available.") | |
raise MisconfigurationException("`smdistributed` package is not available.") |
also add how to instal it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it just a few minor comments
|
||
self.sync_batchnorm = sync_batchnorm | ||
self.dist = SMLightningDistributed() | ||
self.num_nodes = len(os.environ['SM_HOSTS']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we maybe should extend the Environment class by?
cc @awaelchli
Hello @kaushikb11! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-07-14 12:54:55 UTC |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
"`smdistributed` module is not available." | ||
" You would need to enable distributed=smdistributed" | ||
" in the Sagemaker Estimator Object." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"`smdistributed` module is not available." | |
" You would need to enable distributed=smdistributed" | |
" in the Sagemaker Estimator Object." | |
"`smdistributed` package is not available." | |
" You would need to enable `distributed=smdistributed` in the Sagemaker Estimator Object." |
if not _SMDIST_AVAILABLE: | ||
raise MisconfigurationException( | ||
"`smdistributed` module is not available." | ||
" You would need to enable distributed=smdistributed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean adding to to DDPSMPlugin
or where?
|
||
def __init__(self): | ||
if not _SMDIST_AVAILABLE: | ||
raise MisconfigurationException("`smdistributed` module is not available.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise MisconfigurationException("`smdistributed` module is not available.") | |
raise MisconfigurationException("`smdistributed` package is not available.") |
also add how to instal it
log.info("-" * 100) | ||
log.info(f"distributed_backend={self.distributed_backend}") | ||
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes") | ||
log.info("-" * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.info("-" * 100) | |
log.info(f"distributed_backend={self.distributed_backend}") | |
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes") | |
log.info("-" * 100) | |
log.info( | |
"-" * 100 + '\n', | |
f"distributed_backend={self.distributed_backend}" + '\n', | |
f"All DDP processes registered. Starting ddp with {self.world_size} processes" + '\n', | |
"-" * 100, | |
) |
Converted the PR to draft, because seems like certain functionalities have changed and are breaking from Sagemaker's side. |
|
||
class SMLightningDistributed(LightningDistributed): | ||
|
||
def broadcast(self, obj: Any, group=sm_dist.group.WORLD): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sm_dist
is conditionally imported. So this line fails when it's not present in cases where you train without that package.
Is there an ETA for this feature to be available? |
Are there any plans to pick this up again? It seems that Sagemaker DDP now functions as a backend for |
@lballes Thank you for the heads up! Then we support it automatically just with: from pytorch_lightning.strategies import DDPStrategy
# sagemaker backend: https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
import smdistributed.dataparallel.torch.torch_smddp
ddp = DDPStrategy(process_group_backend="smddp")
# Configure the strategy on the Trainer
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8) edit: also, a blogpost! https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/ The sagemaker release has deprecated all the constructs in this PR, so we can close it. |
What does this PR do?
Add Sagemaker DDP Plugin. Reference
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃