Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

FSDP Accelerator auto_wrap ? #5614

Closed
vikigenius opened this issue Apr 5, 2022 · 11 comments
Closed

FSDP Accelerator auto_wrap ? #5614

vikigenius opened this issue Apr 5, 2022 · 11 comments

Comments

@vikigenius
Copy link
Contributor

Currently looking at the discussion here #5433 and the code

with enable_wrap(wrapper_cls=_FSDP, **self._fsdp_kwargs):
wrapped_module = wrap(module)

It seems like you have to manually wrap each individual unit of partition.

Looking at the tutorial for fairscale: https://fairscale.readthedocs.io/en/latest/tutorials/oss.html
There is an auto_wrap function that automatically wraps each submodule for you. This is incredibly convenient if you would just like to wrap a huge pretrained transformer embedder yourself.

Is there a possibility of providing an option to auto_wrap modules?

@epwalsh
Copy link
Member

epwalsh commented Apr 6, 2022

We could definitely add support for that. I'm happy to review a PR.

In the meantime you could just wrap the whole module. There might not be much of a performance difference between wrap()-ing it and auto_wrap()-ing it.

@vikigenius
Copy link
Contributor Author

vikigenius commented Apr 6, 2022

Sure, I will work on a PR.

I am curious behind your statement that there is not much of a difference between wrap and auto-wrap. Can you elaborate?

From my understanding when you wrap a whole module, it will overlap the all gather step only in the final step and all the parameters needed by the whole module should be present in each GPU.

However, if you use auto_wrap you will be wrapping each layer/submodule and only the parameters for a particular layer need to be in the GPU at any given time. This seems like it will be slower but a lot more memory efficient ?

Am I missing something here, or is my understanding wrong?

@epwalsh
Copy link
Member

epwalsh commented Apr 6, 2022

However, if you use auto_wrap you will be wrapping each layer/submodule and only the parameters for a particular layer need to be in the GPU at any given time.

I think you're correct. But the FairScale docs actually say that this will "improve training speed by overlapping the all-gather step across the forward pass." I'm not entirely sure what that means / how sharding individual layers would speed things up. But it does make sense that it would save a lot of memory.

So, ignore my previous comment.

@github-actions
Copy link

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

6 similar comments
@github-actions
Copy link

github-actions bot commented May 9, 2022

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@github-actions
Copy link

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@github-actions
Copy link

github-actions bot commented Jun 8, 2022

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@github-actions
Copy link

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@github-actions
Copy link

github-actions bot commented Jul 6, 2022

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@github-actions
Copy link

@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@epwalsh epwalsh removed their assignment Jul 21, 2022
@github-actions
Copy link

github-actions bot commented Aug 1, 2022

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@github-actions github-actions bot added the stale label Aug 1, 2022
@github-actions github-actions bot closed this as completed Aug 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants