Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] [feat] Experimental CPU offloading #279

Closed
wants to merge 18 commits into from

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Dec 30, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

What it's not:

  • Zero3
  • Complete suite to train a model of any size
  • activation offloading

What it is:

  • a simple nn.Module wrapper which streams the model to and from GPU, so that any model size can run really.

Follow ups needed:
[ ] make sure that streaming overlaps with compute with dedicated CUDA streams. Add profiling to the dummy workload test
[ ] (@anj-s) add a matching optimizer wrapper which handles the streaming for parameter updates
[ ] auto determine the number of slices needed ?
[ ] needs unit tests
[ ] needs documentation

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
cc @mrshenli

Did you have fun?

Make sure you had fun coding 🙃

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 30, 2020
MEMORY = enum.auto()


def _split(modules: nn.Sequential, number_shards: int, strategy: SplitStrategy) -> List[List[nn.Module]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@min-xu-ai This decides how to shard the model, by default it counts the cumulated parameter size per shard and equalizes that



class OffloadDataParallelExperimental(nn.Module):
"""Implements distributed data parallel training with optimizer state sharding and model sharding.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@min-xu-ai the gist of it is here. In this implementation:

  • the model is split into sequential shards
  • the shards are moved to "offload_device" and loaded to gpu on the fly, depending on where the compute is in the FW or BW pass
  • the shard are sync'ed on FW (to get the updated shard)
  • same as OSS, just implemented differently, each rank owns the update for only a shard of the model
  • same as ShardedDDP, all the other gradients can be discarded after the reduce
  • right now the motion is CPU/GPU boundary, but it could be over the network, does not really change the principle

# Slice per slice FW, sync in between
syncRanks = ShardSyncLayer.apply

for i, (p2, p1, n1, n2) in enumerate(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is a bit tricky, the idea is to pre-load the next-next shard (and vice versa in the BW pass) in parallel from the current compute, instead of this blocking the compute wavefront

@SeanNaren
Copy link

Really nice Ben! I really like the wavefront idea, quite an awesome technique to try parallelize movement. Overal very clean API too.

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

@blefaudeux
Copy link
Contributor Author

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

basically splitting by layers is often imbalanced, as expected :) I just removed it, you're right, simpler is better. Another interesting strategy could be to split by flops, may not be the exact same as splitting by memory size, but it would take more work and not really a priority

@stas00
Copy link
Contributor

stas00 commented Jan 29, 2021

Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed.

basically splitting by layers is often imbalanced, as expected :) I just removed it, you're right, simpler is better. Another interesting strategy could be to split by flops, may not be the exact same as splitting by memory size, but it would take more work and not really a priority

and if you want to save time both pytorch and deepspeed have already implemented the splitting/balancing in various ways.

@blefaudeux
Copy link
Contributor Author

reopening this when ready :)

@blefaudeux blefaudeux closed this Feb 5, 2021
@stas00
Copy link
Contributor

stas00 commented Feb 14, 2021

Finally got a chance to read the code - this is really neat, @blefaudeux!

  • As this one doesn't have the microbatching splicing/restoring of the pipe, why is there a restriction on in/out variables? why not just pass *args, **kwargs?
  • And of course there is that nn.Sequential again.

The rest looks awesome! Easy to understand and very clean. I like how you inject a onload/offload layer.

You might be already aware of this paper: Training Large Neural Networks with Constant Memory using a New Execution Algorithm https://arxiv.org/abs/2002.05645v5 (L2L)

This is a very similar solution which also includes a pipeline functionality - but instead of partitions of layers it copies one layer at a time to a GPU, runs a few micro-batches forward, sends the results to the params server, repeats with every other layer, then goes backward doing this exact one layer at a time approach.

Pytorch implementation: https://github.com/TezRomacH/layer-to-layer-pytorch

I haven't measured but L2L might be faster because they can run more data per copy, so in theory it should be more efficient speed-wise.

@anj-s
Copy link
Contributor

anj-s commented Feb 15, 2021

Finally got a chance to read the code - this is really neat, @blefaudeux!

  • As this one doesn't have the microbatching splicing/restoring of the pipe, why is there a restriction on in/out variables? why not just pass *args, **kwargs?
  • And of course there is that nn.Sequential again.

The rest looks awesome! Easy to understand and very clean. I like how you inject a onload/offload layer.

You might be already aware of this paper: Training Large Neural Networks with Constant Memory using a New Execution Algorithm https://arxiv.org/abs/2002.05645v5 (L2L)

This is a very similar solution which also includes a pipeline functionality - but instead of partitions of layers it copies one layer at a time to a GPU, runs a few micro-batches forward, sends the results to the params server, repeats with every other layer, then goes backward doing this exact one layer at a time approach.

Pytorch implementation: https://github.com/TezRomacH/layer-to-layer-pytorch

I haven't measured but L2L might be faster because they can run more data per copy, so in theory it should be more efficient speed-wise.

Yep! The L2L paper is very similar to the offload solution in this branch. I looked into the PyTorch implementation which was great except for the fact that we want the backward pass to be overridden in a way that does not require the user to call a different function. We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested). I am running a few tests with support for microbatches(should help with the overhead of DTH copies) and will have some comparison numbers soon.

Re questions:

  • I am keeping the the input/output generic with *args, **kwargs. can you point me to the PR line where there is a current restriction? It might be removed in the latest revision.
  • I think we are going to start with nn.Sequential models :) We might explore how to implement something like this for all types of models but this is something we are going to start with.

@stas00
Copy link
Contributor

stas00 commented Feb 15, 2021

Yep! The L2L paper is very similar to the offload solution in this branch. I looked into the PyTorch implementation which was great except for the fact that we want the backward pass to be overridden in a way that does not require the user to call a different function. We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested). I am running a few tests with support for microbatches(should help with the overhead of DTH copies) and will have some comparison numbers soon.

Yay! That's great news - thank you for sharing that info, @anj-s

Re questions:

I am keeping the the input/output generic with *args, **kwargs. can you point me to the PR line where there is a current restriction? It might be removed in the latest revision.

def forward(self, *inputs): # type: ignore
return self.model_shard(*inputs) if isinstance(inputs, tuple) else self.model_shard(inputs)

return inputs if isinstance(inputs, tuple) else (inputs,)

Albeit there is no restriction for them needing to be Tensors, like the pipeline has.

But if you're going to run micro-batches down the road the "must-be Tensor and first dimension of batch"-restriction might re-surface.

Practically, transformers models pass around a lot of kwargs in forward - control flags. It should be possible to pass all kwargs as positional, but there are so many of those it'd be quite error-prone to get the order right.

I think we are going to start with nn.Sequential models :) We might explore how to implement something like this for all types of models but this is something we are going to start with.

Understood.

I wish I could find a way an easy way to convert a tree-like model, into a flat one. Which is the case for most (all?) transformers models. We would be your early adopters, but with nn.Sequential-restriction it's a big barrier to adoption.

I've started looking into pytorch FX and projects that use FX to hopefully find an automated solution, but I'm new to this domain, so it's a learning process. I'm hoping to find a way to do the conversion at the graph level, leaving the source code non-nn.Sequential. If you have pointers at perhaps some project already implementing this I'm all ears.

@stas00
Copy link
Contributor

stas00 commented Feb 15, 2021

We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested).

While this is obviously operation- and hardware-specific, but as as a general rule - is it faster to re-calculate or offload (copy to and from)?

@blefaudeux
Copy link
Contributor Author

We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested).

While this is obviously operation- and hardware-specific, but as as a general rule - is it faster to re-calculate or offload (copy to and from)?

to me it would depend on whether the execution is distributed or not, this branch is handling a very specific usecase where we're on a single machine, and in that case I would guess that using the dual execution capabilities of CUDA devices (coms & compute truly simultaneously) should make it possible to mask the coms cost. This branch only makes sense for a very big model, and in that case if the memory ceiling is removed then the next one is that runs will be obviously compute bound, which in turn means that I'm doubtful about adding compute to the mix.

@stas00
Copy link
Contributor

stas00 commented Feb 16, 2021

Yes, it'd help to limit the discussion to a single machine in this context.

So what you're saying is that comms will be much faster since it should happen asynchronously and will be pre-fetched just in time for its usage.

Sidenote: I think DeepSpeed also performs some compute on CPU where it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants