Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a backend agnostic Join for LightningLite #14635

Open
awaelchli opened this issue Sep 10, 2022 · 4 comments
Open

Provide a backend agnostic Join for LightningLite #14635

awaelchli opened this issue Sep 10, 2022 · 4 comments
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Sep 10, 2022

🚀 Feature

Provide Join through an intuitive API in LightningLite and make it backend agnostic, i.e., switching from DDP to single-device and vice versa should not require changes to the code.

Motivation

The DDP Join context manager in PyTorch allows you to run your loops with different number of items on each rank, without getting out of sync issues and hangs in collective calls. PyTorch calls this "uneven inputs". Normally, the DistributedSampler would "even out" the data on each rank by inserting fake, repeated data.

Pitch

Provide Join in LightningLite. More specifically, through the Strategy. The idea here is that once the user added the join to their loop, they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

Note, by default Lite will auto-insert a DistributedSampler on the dataloader for the user. The tricky part here is that Join is only useful if you set drop_last=False in the sampler. How do we link the two features together so that they work in a meaningful way?

Alternatives

Do not introduce this. The user can just use the raw PyTorch APIs.

Additional context

Once this lands in Lite, the PL strategies can also make use of it in their implementations. This can be developed in parallel to #3325.


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @Borda @carmocca @justusschock @awaelchli

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers feature Is an improvement or enhancement fabric lightning.fabric.Fabric and removed needs triage Waiting to be triaged by maintainers labels Sep 10, 2022
@awaelchli awaelchli added this to the pl:future milestone Sep 10, 2022
@justusschock
Copy link
Member

cc @otaj who was working on Join I think :)

@carmocca
Copy link
Contributor

Is this proposal relevant to Join only? Or should we instead tackle it from the wider perspective of #7534?

Join would be DDP only, but this idea:

they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

should apply to all collective calls.

@otaj
Copy link
Contributor

otaj commented Sep 13, 2022

Oh, yes, this is a great idea! However, I gotta agree with @carmocca that it might be better to have this applied to all collective calls. Maybe even have something like lightning.lite.distributed module/package which would contain all of these calls

@carmocca
Copy link
Contributor

Related #13821: DeepSpeed did something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants