[DRAFT] [feat] Experimental CPU offloading #279

blefaudeux · 2020-12-30T01:03:46Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

What it's not:

Zero3
Complete suite to train a model of any size
activation offloading

What it is:

a simple nn.Module wrapper which streams the model to and from GPU, so that any model size can run really.

Follow ups needed:
[ ] make sure that streaming overlaps with compute with dedicated CUDA streams. Add profiling to the dummy workload test
[ ] (@anj-s) add a matching optimizer wrapper which handles the streaming for parameter updates
[ ] auto determine the number of slices needed ?
[ ] needs unit tests
[ ] needs documentation

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
cc @mrshenli

Did you have fun?

Make sure you had fun coding 🙃

blefaudeux · 2020-12-30T01:04:46Z

fairscale/nn/data_parallel/offload_ddp.py

+    MEMORY = enum.auto()
+
+
+def _split(modules: nn.Sequential, number_shards: int, strategy: SplitStrategy) -> List[List[nn.Module]]:


@min-xu-ai This decides how to shard the model, by default it counts the cumulated parameter size per shard and equalizes that

blefaudeux · 2020-12-30T01:08:29Z

fairscale/nn/data_parallel/offload_ddp.py

+
+
+class OffloadDataParallelExperimental(nn.Module):
+    """Implements distributed data parallel training with optimizer state sharding and model sharding.


@min-xu-ai the gist of it is here. In this implementation:

the model is split into sequential shards

the shards are moved to "offload_device" and loaded to gpu on the fly, depending on where the compute is in the FW or BW pass

the shard are sync'ed on FW (to get the updated shard)

same as OSS, just implemented differently, each rank owns the update for only a shard of the model

same as ShardedDDP, all the other gradients can be discarded after the reduce

right now the motion is CPU/GPU boundary, but it could be over the network, does not really change the principle

blefaudeux · 2020-12-30T01:09:40Z

fairscale/nn/data_parallel/offload_ddp.py

+        # Slice per slice FW, sync in between
+        syncRanks = ShardSyncLayer.apply
+
+        for i, (p2, p1, n1, n2) in enumerate(


this part is a bit tricky, the idea is to pre-load the next-next shard (and vice versa in the BW pass) in parallel from the current compute, instead of this blocking the compute wavefront

SeanNaren · 2020-12-30T12:01:11Z

Really nice Ben! I really like the wavefront idea, quite an awesome technique to try parallelize movement. Overal very clean API too.

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

blefaudeux · 2021-01-05T18:05:58Z

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

basically splitting by layers is often imbalanced, as expected :) I just removed it, you're right, simpler is better. Another interesting strategy could be to split by flops, may not be the exact same as splitting by memory size, but it would take more work and not really a priority

…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

…ption

stas00 · 2021-01-29T19:38:50Z

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

basically splitting by layers is often imbalanced, as expected :) I just removed it, you're right, simpler is better. Another interesting strategy could be to split by flops, may not be the exact same as splitting by memory size, but it would take more work and not really a priority

and if you want to save time both pytorch and deepspeed have already implemented the splitting/balancing in various ways.

blefaudeux · 2021-02-05T22:35:36Z

reopening this when ready :)

stas00 · 2021-02-14T21:44:42Z

Finally got a chance to read the code - this is really neat, @blefaudeux!

As this one doesn't have the microbatching splicing/restoring of the pipe, why is there a restriction on in/out variables? why not just pass *args, **kwargs?
And of course there is that nn.Sequential again.

The rest looks awesome! Easy to understand and very clean. I like how you inject a onload/offload layer.

You might be already aware of this paper: Training Large Neural Networks with Constant Memory using a New Execution Algorithm https://arxiv.org/abs/2002.05645v5 (L2L)

This is a very similar solution which also includes a pipeline functionality - but instead of partitions of layers it copies one layer at a time to a GPU, runs a few micro-batches forward, sends the results to the params server, repeats with every other layer, then goes backward doing this exact one layer at a time approach.

Pytorch implementation: https://github.com/TezRomacH/layer-to-layer-pytorch

I haven't measured but L2L might be faster because they can run more data per copy, so in theory it should be more efficient speed-wise.

anj-s · 2021-02-15T02:48:05Z

Finally got a chance to read the code - this is really neat, @blefaudeux!

As this one doesn't have the microbatching splicing/restoring of the pipe, why is there a restriction on in/out variables? why not just pass *args, **kwargs?

And of course there is that nn.Sequential again.

The rest looks awesome! Easy to understand and very clean. I like how you inject a onload/offload layer.

You might be already aware of this paper: Training Large Neural Networks with Constant Memory using a New Execution Algorithm https://arxiv.org/abs/2002.05645v5 (L2L)

This is a very similar solution which also includes a pipeline functionality - but instead of partitions of layers it copies one layer at a time to a GPU, runs a few micro-batches forward, sends the results to the params server, repeats with every other layer, then goes backward doing this exact one layer at a time approach.

Pytorch implementation: https://github.com/TezRomacH/layer-to-layer-pytorch

I haven't measured but L2L might be faster because they can run more data per copy, so in theory it should be more efficient speed-wise.

Yep! The L2L paper is very similar to the offload solution in this branch. I looked into the PyTorch implementation which was great except for the fact that we want the backward pass to be overridden in a way that does not require the user to call a different function. We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested). I am running a few tests with support for microbatches(should help with the overhead of DTH copies) and will have some comparison numbers soon.

Re questions:

I am keeping the the input/output generic with *args, **kwargs. can you point me to the PR line where there is a current restriction? It might be removed in the latest revision.
I think we are going to start with nn.Sequential models :) We might explore how to implement something like this for all types of models but this is something we are going to start with.

stas00 · 2021-02-15T03:04:52Z

Yep! The L2L paper is very similar to the offload solution in this branch. I looked into the PyTorch implementation which was great except for the fact that we want the backward pass to be overridden in a way that does not require the user to call a different function. We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested). I am running a few tests with support for microbatches(should help with the overhead of DTH copies) and will have some comparison numbers soon.

Yay! That's great news - thank you for sharing that info, @anj-s

Re questions:

I am keeping the the input/output generic with *args, **kwargs. can you point me to the PR line where there is a current restriction? It might be removed in the latest revision.

fairscale/fairscale/nn/misc/offload.py

Lines 78 to 79 in 6bfeaed

    
           def forward(self, *inputs):  # type: ignore 
        
               return self.model_shard(*inputs) if isinstance(inputs, tuple) else self.model_shard(inputs)

fairscale/fairscale/nn/misc/offload.py

Line 139 in 6bfeaed

return inputs if isinstance(inputs, tuple) else (inputs,)

Albeit there is no restriction for them needing to be Tensors, like the pipeline has.

But if you're going to run micro-batches down the road the "must-be Tensor and first dimension of batch"-restriction might re-surface.

Practically, transformers models pass around a lot of kwargs in forward - control flags. It should be possible to pass all kwargs as positional, but there are so many of those it'd be quite error-prone to get the order right.

I think we are going to start with nn.Sequential models :) We might explore how to implement something like this for all types of models but this is something we are going to start with.

Understood.

I wish I could find a way an easy way to convert a tree-like model, into a flat one. Which is the case for most (all?) transformers models. We would be your early adopters, but with nn.Sequential-restriction it's a big barrier to adoption.

I've started looking into pytorch FX and projects that use FX to hopefully find an automated solution, but I'm new to this domain, so it's a learning process. I'm hoping to find a way to do the conversion at the graph level, leaving the source code non-nn.Sequential. If you have pointers at perhaps some project already implementing this I'm all ears.

stas00 · 2021-02-15T03:14:02Z

We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested).

While this is obviously operation- and hardware-specific, but as as a general rule - is it faster to re-calculate or offload (copy to and from)?

blefaudeux · 2021-02-16T23:11:07Z

We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested).

While this is obviously operation- and hardware-specific, but as as a general rule - is it faster to re-calculate or offload (copy to and from)?

to me it would depend on whether the execution is distributed or not, this branch is handling a very specific usecase where we're on a single machine, and in that case I would guess that using the dual execution capabilities of CUDA devices (coms & compute truly simultaneously) should make it possible to mask the coms cost. This branch only makes sense for a very big model, and in that case if the memory ceiling is removed then the next one is that runs will be obviously compute bound, which in turn means that I'm doubtful about adding compute to the mix.

stas00 · 2021-02-16T23:15:22Z

Yes, it'd help to limit the discussion to a single machine in this context.

So what you're saying is that comms will be much faster since it should happen asynchronously and will be pre-fetched just in time for its usage.

Sidenote: I think DeepSpeed also performs some compute on CPU where it makes sense.

clean start

f166609

blefaudeux requested review from myleott, shruti-bh, min-xu-ai and msbaines December 30, 2020 01:03

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 30, 2020

blefaudeux commented Dec 30, 2020

View reviewed changes

blefaudeux added 3 commits January 4, 2021 14:02

Merge remote-tracking branch 'upstream/master' into offload_experimental

fc1310a

removing per layer split strategy, probably not that useful indeed

26cfd92

Merge remote-tracking branch 'upstream/master' into offload_experimental

0363630

blefaudeux added 7 commits January 6, 2021 22:18

initial transformer benchmark

ff44ddd

Merge remote-tracking branch 'upstream/master' into offload_experimental

f20e3f8

hack, enable testing ViT + offload, python3 benchmarks/oss.py --epoch…

3bcea0a

…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

proper cuda streams and device, something off in terms of mems consum…

62c15e4

…ption

minor, stashing

43c56cd

Merge branch 'master' into offload_experimental

042daa3

unit test fix

850c5bf

blefaudeux and others added 7 commits February 1, 2021 20:26

Merge branch 'master' into offload_experimental

52e0be4

removing all the distributed parts

1f6c018

simpler test, needs debugging

8490dd8

working OOP, running a model which does not fit on the gpu memory

9ec3892

Merge branch 'master' into offload_experimental

d4e929d

spring cleaning

8e92a4c

removing the ill-advised optimizer bits, better keep that orthogonal

6bfeaed

blefaudeux closed this Feb 5, 2021

anj-s mentioned this pull request Feb 24, 2021

[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] [feat] Experimental CPU offloading #279

[DRAFT] [feat] Experimental CPU offloading #279

blefaudeux commented Dec 30, 2020 •

edited

Loading

blefaudeux Dec 30, 2020

blefaudeux Dec 30, 2020

blefaudeux Dec 30, 2020

SeanNaren commented Dec 30, 2020

blefaudeux commented Jan 5, 2021

stas00 commented Jan 29, 2021

blefaudeux commented Feb 5, 2021

stas00 commented Feb 14, 2021

anj-s commented Feb 15, 2021

stas00 commented Feb 15, 2021 •

edited

Loading

stas00 commented Feb 15, 2021

blefaudeux commented Feb 16, 2021

stas00 commented Feb 16, 2021 •

edited

Loading

		MEMORY = enum.auto()


		def _split(modules: nn.Sequential, number_shards: int, strategy: SplitStrategy) -> List[List[nn.Module]]:



		class OffloadDataParallelExperimental(nn.Module):
		"""Implements distributed data parallel training with optimizer state sharding and model sharding.

[DRAFT] [feat] Experimental CPU offloading #279

[DRAFT] [feat] Experimental CPU offloading #279

Conversation

blefaudeux commented Dec 30, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

blefaudeux Dec 30, 2020

Choose a reason for hiding this comment

blefaudeux Dec 30, 2020

Choose a reason for hiding this comment

blefaudeux Dec 30, 2020

Choose a reason for hiding this comment

SeanNaren commented Dec 30, 2020

blefaudeux commented Jan 5, 2021

stas00 commented Jan 29, 2021

blefaudeux commented Feb 5, 2021

stas00 commented Feb 14, 2021

anj-s commented Feb 15, 2021

stas00 commented Feb 15, 2021 • edited Loading

stas00 commented Feb 15, 2021

blefaudeux commented Feb 16, 2021

stas00 commented Feb 16, 2021 • edited Loading

blefaudeux commented Dec 30, 2020 •

edited

Loading

stas00 commented Feb 15, 2021 •

edited

Loading

stas00 commented Feb 16, 2021 •

edited

Loading