-
Notifications
You must be signed in to change notification settings - Fork 2.2k
FSDP Accelerator auto_wrap ? #5614
Comments
We could definitely add support for that. I'm happy to review a PR. In the meantime you could just |
Sure, I will work on a PR. I am curious behind your statement that there is not much of a difference between wrap and auto-wrap. Can you elaborate? From my understanding when you wrap a whole module, it will overlap the all gather step only in the final step and all the parameters needed by the whole module should be present in each GPU. However, if you use auto_wrap you will be wrapping each layer/submodule and only the parameters for a particular layer need to be in the GPU at any given time. This seems like it will be slower but a lot more memory efficient ? Am I missing something here, or is my understanding wrong? |
I think you're correct. But the FairScale docs actually say that this will "improve training speed by overlapping the all-gather step across the forward pass." I'm not entirely sure what that means / how sharding individual layers would speed things up. But it does make sense that it would save a lot of memory. So, ignore my previous comment. |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
6 similar comments
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
@epwalsh this is just a friendly ping to make sure you haven't forgotten about this issue 😜 |
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇 |
Currently looking at the discussion here #5433 and the code
allennlp/allennlp/nn/parallel/fairscale_fsdp_accelerator.py
Lines 126 to 127 in 1caf0da
It seems like you have to manually wrap each individual unit of partition.
Looking at the tutorial for fairscale: https://fairscale.readthedocs.io/en/latest/tutorials/oss.html
There is an
auto_wrap
function that automatically wraps each submodule for you. This is incredibly convenient if you would just like to wrap a huge pretrained transformer embedder yourself.Is there a possibility of providing an option to auto_wrap modules?
The text was updated successfully, but these errors were encountered: