FSDP Accelerator auto_wrap ? #5614

vikigenius · 2022-04-05T00:53:37Z

Currently looking at the discussion here #5433 and the code

allennlp/allennlp/nn/parallel/fairscale_fsdp_accelerator.py

Lines 126 to 127 in 1caf0da

    
           with enable_wrap(wrapper_cls=_FSDP, **self._fsdp_kwargs): 
        
               wrapped_module = wrap(module)

It seems like you have to manually wrap each individual unit of partition.

Looking at the tutorial for fairscale: https://fairscale.readthedocs.io/en/latest/tutorials/oss.html
There is an auto_wrap function that automatically wraps each submodule for you. This is incredibly convenient if you would just like to wrap a huge pretrained transformer embedder yourself.

Is there a possibility of providing an option to auto_wrap modules?

The text was updated successfully, but these errors were encountered:

epwalsh · 2022-04-06T19:00:20Z

We could definitely add support for that. I'm happy to review a PR.

In the meantime you could just wrap the whole module. There might not be much of a performance difference between wrap()-ing it and auto_wrap()-ing it.

vikigenius · 2022-04-06T20:07:31Z

Sure, I will work on a PR.

I am curious behind your statement that there is not much of a difference between wrap and auto-wrap. Can you elaborate?

From my understanding when you wrap a whole module, it will overlap the all gather step only in the final step and all the parameters needed by the whole module should be present in each GPU.

However, if you use auto_wrap you will be wrapping each layer/submodule and only the parameters for a particular layer need to be in the GPU at any given time. This seems like it will be slower but a lot more memory efficient ?

Am I missing something here, or is my understanding wrong?

epwalsh · 2022-04-06T23:19:18Z

However, if you use auto_wrap you will be wrapping each layer/submodule and only the parameters for a particular layer need to be in the GPU at any given time.

I think you're correct. But the FairScale docs actually say that this will "improve training speed by overlapping the all-gather step across the forward pass." I'm not entirely sure what that means / how sharding individual layers would speed things up. But it does make sense that it would save a lot of memory.

So, ignore my previous comment.

github-actions · 2022-04-25T16:09:45Z