Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ignite with Megatron-style model-parallel PyTorch modules #1709

Open
g-karthik opened this issue Feb 26, 2021 · 7 comments
Open

Using ignite with Megatron-style model-parallel PyTorch modules #1709

g-karthik opened this issue Feb 26, 2021 · 7 comments

Comments

@g-karthik
Copy link

❓ Questions/Help/Support

This is a somewhat general question, but I'd love a detailed response. When wanting to go beyond standard data-parallel training towards hybrid data+model-parallel training (like Megatron-LM), what are some ignite abstractions to use and avoid?

@vfdev-5

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 26, 2021

@g-karthik thanks for an interesting question! I haven't yet explored this hybrid data+model-parallel trainings and would love to test that.

@sdesrozis any thoughts ?
@Nic-Ma have you tried that in MONAI ?

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Feb 26, 2021

Hi @vfdev-5 ,

MONAI has a model-parallel tutorial: https://github.com/Project-MONAI/research-contributions/tree/master/lamp-automated-model-parallelism
But I think it's not based on ignite workflow.

Thanks.

@sdesrozis
Copy link
Contributor

I didn't yet experienced model parallel training. I would be very pleased to explore this topic.

@sdesrozis
Copy link
Contributor

sdesrozis commented Feb 26, 2021

My first thoughts if we just consider model parallel on 2 GPUs

  • engine is agnostic to device
  • x, y and y_pred are on different devices. You can't use create_supervised_xxx because data are moved on same device...
  • metrics should be ok because it relies on output of update function. If you write your own function, it should work.
  • auto_model from idist could not work because if multiple GPUs are detected, DataParallel is used...
  • I think that checkpoint and loggers should work but can't be 100% sure...

We first should test this before try hybrid data+model parallelism.

@g-karthik could you explain how you think distribute your model and data in that case ? Thanks in advance.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 26, 2021

@sdesrozis take a look at https://www.deepspeed.ai/tutorials/pipeline/ and https://www.deepspeed.ai/tutorials/megatron/ and the example.

I think in addition to what @sdesrozis said, ignite.distributed module wont be aware of the "topology". It implicitly considers data parallel only axis. In the worst case, this can lead to hangs while all reducing metrics...

@sdesrozis
Copy link
Contributor

@vfdev-5 That's exactly what I was thinking about the collective ops in metrics.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 1, 2021

@g-karthik @sdesrozis I'm working on how to make ignite distributed aware of particular data parallel configuration. I'll push soon a draft PR with new API and example using DeepSpeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants