Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed support for ignite.distributed #2008

Open
Kashu7100 opened this issue May 20, 2021 · 8 comments
Open

DeepSpeed support for ignite.distributed #2008

Kashu7100 opened this issue May 20, 2021 · 8 comments
Projects

Comments

@Kashu7100
Copy link

🚀 Feature

Pytorch lightning recently added native support for MS DeepSpeed.

I believe it is also helpful for users if ignite incorporates the DeepSpeed pipeline for memory-efficient distributed training.

1. for idist.auto_model ..?

To initialize the DeepSpeed engine:

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

And for distributed environment setup, we need to replace torch.distributed.init_process_group(...) to deepspeed.init_distributed()

2. checkpoint handler

slightly different thing for checkpointing

model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)
@sdesrozis
Copy link
Contributor

sdesrozis commented May 20, 2021

@Kashu7100 Thank you for this suggestion!

I confirm that it would be very nice to support DeepSpeed with idist. Maybe a new backend could be introduced, what do you think @vfdev-5 and @fco-dv ?

Currently we have docker environment configured with MS DeepSpeed.

https://github.com/pytorch/ignite/tree/master/docker/msdp

Would you like to contribute on this ? It seems you already know how to do it 😉

@Kashu7100
Copy link
Author

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

with idist.Parallel(backend=backend, **spawn_kwargs) as parallel:
        parallel.run(main, config)

@sdesrozis
Copy link
Contributor

sdesrozis commented May 21, 2021

It depends on what you want to do. The features list of msdp is quite long and there are more or less deep impacts.

For instance, I think that the pipeline parallelism would be a very nice feature to have but not trivial to adapt.

Maybe a first step could be the distributed parallelism using the simplified api as you mentioned. Thus, it may be a new backend to develop and integrate in our idist.Parallel.

You can have a look here. Btw, it's not an easy task and maybe I'm wrong about what to do. @vfdev-5 was looking further on this, maybe he could help in the discussion.

@sdesrozis
Copy link
Contributor

sdesrozis commented May 21, 2021

@Kashu7100 Finally, introducing a new backend does not seem to be the good option. Have a look here, and you would see that native PyTorch distributed is used when distributed environment variables are set.

That is a good news for simple use cases.

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

I would say yes.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 21, 2021

@Kashu7100 thanks for the feature request!

Yes, we plan to improve our support of deepspeed framework which is roughly:

  • cmd line launcher + config file
  • model_engine wrapper
  • various modern optimizers
  • pipeline parallelism
  • amp using nvidia/apex
  • customized distributed (support azure) on top of torch distributed

Our idea was to provide basic integration examples of how to use ignite and deepspeed together. I looked at it multiple times and due to certain overlap between the framework it was not obvious where to put the split.

@sdesrozis I'm not sure whether we should add it as a new backend or not. Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

@sdesrozis
Copy link
Contributor

customized distributed (support azure) on top of torch distributed

I think this could be integrated in our native backend, beside slurm.

@sdesrozis I'm not sure whether we should add it as a new backend or not.

IMO it is not necessary.

Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

That is a good option. As discussed a few weeks ago, the specific engine should be the tricky part. Otherwise, auto helpers could do the job. I suppose.

@vfdev-5 vfdev-5 added this to To do in 0.5.1 via automation May 28, 2021
@saifullah3396
Copy link

Hi, is there any update on this?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jul 21, 2023

@saifullah3396 well this feature is not really a priority right now. If you would like to help with, we can guide your development from ignite side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
0.5.1
  
To do
Development

No branches or pull requests

4 participants