-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] [feat] Experimental CPU offloading #279
Conversation
MEMORY = enum.auto() | ||
|
||
|
||
def _split(modules: nn.Sequential, number_shards: int, strategy: SplitStrategy) -> List[List[nn.Module]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@min-xu-ai This decides how to shard the model, by default it counts the cumulated parameter size per shard and equalizes that
|
||
|
||
class OffloadDataParallelExperimental(nn.Module): | ||
"""Implements distributed data parallel training with optimizer state sharding and model sharding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@min-xu-ai the gist of it is here. In this implementation:
- the model is split into sequential shards
- the shards are moved to "offload_device" and loaded to gpu on the fly, depending on where the compute is in the FW or BW pass
- the shard are sync'ed on FW (to get the updated shard)
- same as OSS, just implemented differently, each rank owns the update for only a shard of the model
- same as ShardedDDP, all the other gradients can be discarded after the reduce
- right now the motion is CPU/GPU boundary, but it could be over the network, does not really change the principle
# Slice per slice FW, sync in between | ||
syncRanks = ShardSyncLayer.apply | ||
|
||
for i, (p2, p1, n1, n2) in enumerate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is a bit tricky, the idea is to pre-load the next-next shard (and vice versa in the BW pass) in parallel from the current compute, instead of this blocking the compute wavefront
Really nice Ben! I really like the wavefront idea, quite an awesome technique to try parallelize movement. Overal very clean API too. Regarding split strategy num_layers and memory, have you seen a large difference between the two? Just wondering if the additional split by layers strategy is needed. |
basically splitting by layers is often imbalanced, as expected :) I just removed it, you're right, simpler is better. Another interesting strategy could be to split by flops, may not be the exact same as splitting by memory size, but it would take more work and not really a priority |
…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
and if you want to save time both pytorch and deepspeed have already implemented the splitting/balancing in various ways. |
reopening this when ready :) |
Finally got a chance to read the code - this is really neat, @blefaudeux!
The rest looks awesome! Easy to understand and very clean. I like how you inject a onload/offload layer. You might be already aware of this paper: Training Large Neural Networks with Constant Memory using a New Execution Algorithm https://arxiv.org/abs/2002.05645v5 (L2L) This is a very similar solution which also includes a pipeline functionality - but instead of partitions of layers it copies one layer at a time to a GPU, runs a few micro-batches forward, sends the results to the params server, repeats with every other layer, then goes backward doing this exact one layer at a time approach. Pytorch implementation: https://github.com/TezRomacH/layer-to-layer-pytorch I haven't measured but L2L might be faster because they can run more data per copy, so in theory it should be more efficient speed-wise. |
Yep! The L2L paper is very similar to the offload solution in this branch. I looked into the PyTorch implementation which was great except for the fact that we want the backward pass to be overridden in a way that does not require the user to call a different function. We are incorporating the L2L approach (gradient and activation offloading, microbatches, CMP) into the final solution (There were a few PRs this week in case you are interested). I am running a few tests with support for microbatches(should help with the overhead of DTH copies) and will have some comparison numbers soon. Re questions:
|
Yay! That's great news - thank you for sharing that info, @anj-s
fairscale/fairscale/nn/misc/offload.py Lines 78 to 79 in 6bfeaed
fairscale/fairscale/nn/misc/offload.py Line 139 in 6bfeaed
Albeit there is no restriction for them needing to be Tensors, like the pipeline has. But if you're going to run micro-batches down the road the "must-be Tensor and first dimension of batch"-restriction might re-surface. Practically,
Understood. I wish I could find a way an easy way to convert a tree-like model, into a flat one. Which is the case for most (all?) transformers models. We would be your early adopters, but with I've started looking into pytorch FX and projects that use FX to hopefully find an automated solution, but I'm new to this domain, so it's a learning process. I'm hoping to find a way to do the conversion at the graph level, leaving the source code non-nn.Sequential. If you have pointers at perhaps some project already implementing this I'm all ears. |
While this is obviously operation- and hardware-specific, but as as a general rule - is it faster to re-calculate or offload (copy to and from)? |
to me it would depend on whether the execution is distributed or not, this branch is handling a very specific usecase where we're on a single machine, and in that case I would guess that using the dual execution capabilities of CUDA devices (coms & compute truly simultaneously) should make it possible to mask the coms cost. This branch only makes sense for a very big model, and in that case if the memory ceiling is removed then the next one is that runs will be obviously compute bound, which in turn means that I'm doubtful about adding compute to the mix. |
Yes, it'd help to limit the discussion to a single machine in this context. So what you're saying is that comms will be much faster since it should happen asynchronously and will be pre-fetched just in time for its usage. Sidenote: I think DeepSpeed also performs some compute on CPU where it makes sense. |
Before submitting
What does this PR do?
What it's not:
What it is:
Follow ups needed:
[ ] make sure that streaming overlaps with compute with dedicated CUDA streams. Add profiling to the dummy workload test
[ ] (@anj-s) add a matching optimizer wrapper which handles the streaming for parameter updates
[ ] auto determine the number of slices needed ?
[ ] needs unit tests
[ ] needs documentation
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
cc @mrshenli
Did you have fun?
Make sure you had fun coding 🙃