[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

anj-s · 2021-02-24T23:38:38Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
[ X] Did you read the contributor guideline?
Did you make sure to update the docs?
[X ] Did you write any new necessary tests?

What does this PR do?

Add experimental support for using the OffloadModel API which enables training large models on a single GPU. OffloadModel chunks the given model into a list of modules and copies a given chunk from CPU->GPU during the FW pass. After FW computation the chunk is copied back to the CPU. The process is repeated for the BW pass. The current implementations supports:

Specifying number of slices that you want to chunk your model into.
Support for activation checkpointing.
Support for running multiple microbatches at one time to offset latency due to multiple param copies fro CPU<->GPU.

Caveats:

This initial implementation only supports nn.Sequential models.
The throughput of the model is smaller than when running without Offload. We will be continue to work on improving performance and suggest configurations that will enable the highest throughput.

References:

This work is inspired by ZeRO Offload and L2L.
The starting point of this PR heavily borrows from PR by @blefaudeux .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

…ption

* initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * avoid saving inputs * fix lint errors Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

) * initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure * cp work, incorrect output dimensions still need to be fixed * fixed activation outputs * intermediate cp of work * add tests * fix lint errors Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

min-xu-ai · 2021-02-25T00:02:58Z

quick question on the test file location. Should it be

tests/nn/experimental/test_offload.py

or

tests/experimental/nn/test_offload.py?

I think we mirror the dirs. File names can be shorten, like we have test_fsdp*.py but all in the same mirrored dir. That seems like a good convention?

min-xu-ai · 2021-02-25T00:05:10Z

also, see this comment: Lightning-AI/pytorch-lightning#6152 (comment)

anj-s · 2021-02-25T01:39:44Z

quick question on the test file location. Should it be

tests/nn/experimental/test_offload.py

or

tests/experimental/nn/test_offload.py?

I think we mirror the dirs. File names can be shorten, like we have test_fsdp*.py but all in the same mirrored dir. That seems like a good convention?

I agree. I want it to be in experimental/ just like I moved tests for ampnet.

blefaudeux · 2021-02-25T19:29:20Z

fairscale/experimental/nn/offload.py

+
+    def __init__(
+        self,
+        model_cpu: nn.Sequential,  # hard pre-requisite for now, easier model slicing


discussing elsewhere, but I think that the FSDP way (wrap submodules) could apply here, and why not keeping the two options (either one monolithic nn.Sequential call, or a per-module wrap) open ? I think that it adds a lot of flexibility and could be good enough in practice

practically speaking this means that https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=forward%20hook#torch.nn.Module.register_forward_pre_hook can be used, but the latency will be pretty terrible if used "naively" (wait for the FW wavefront to touch, pull in the module), so it's not really a silver bullet

fairscale/experimental/nn/offload.py

blefaudeux · 2021-02-25T19:46:07Z

Thanks for the great PR @anj-s, it's super comprehensive ! I think that we can try to make it more generic over time, it does not have to be perfect right now and it's a very solid basis I believe. Minor nits if you don't mind and curious to have @min-xu-ai eyes on that

min-xu-ai

Sorry for late to the party. I agree with Ben that this gives us a good start. Lots of interesting things we can do potentially with this.

benchmarks/experimental/offload.py

fairscale/experimental/nn/offload.py

blefaudeux and others added 30 commits December 29, 2020 16:57

clean start

f166609

Merge remote-tracking branch 'upstream/master' into offload_experimental

fc1310a

removing per layer split strategy, probably not that useful indeed

26cfd92

Merge remote-tracking branch 'upstream/master' into offload_experimental

0363630

initial transformer benchmark

ff44ddd

Merge remote-tracking branch 'upstream/master' into offload_experimental

f20e3f8

hack, enable testing ViT + offload, python3 benchmarks/oss.py --epoch…

3bcea0a

…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

proper cuda streams and device, something off in terms of mems consum…

62c15e4

…ption

minor, stashing

43c56cd

Merge branch 'master' into offload_experimental

042daa3

unit test fix

850c5bf

Merge branch 'master' into offload_experimental

52e0be4

removing all the distributed parts

1f6c018

simpler test, needs debugging

8490dd8

working OOP, running a model which does not fit on the gpu memory

9ec3892

Merge branch 'master' into offload_experimental

d4e929d

spring cleaning

8e92a4c

removing the ill-advised optimizer bits, better keep that orthogonal

6bfeaed

add support for microbatches

0ca26a2

revert benchmark config changes

0b70ffa

add parametrization

9cdea8b

fix lint errors and tests

38c541d

skip test for 1.5

8e32380

fix lint errors

0d7201f

skip test if there are no GPUs

cbe8acc

fix lint errors

2dff98e

fix lint errors

239713d

Anjali Sridhar added 11 commits February 23, 2021 06:36

Merge branch 'master' into offload_experimental

8870d17

Merge branch 'offload_experimental' into seq_benchmark

cea1426

split benchmark configs

4306696

Merge branch 'split-benchmark-configs' into seq_benchmark

ce575df

remove print statements

697887c

fix lint errors

e7336e9

remove unused print

e73d04b

stress testing

65f2f92

remove unused file

2d0d7f5

change param nae

b8c493f

fix merge conflicts

f02b6e2

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2021

anj-s requested review from min-xu-ai and blefaudeux February 24, 2021 23:38

lint fixes

3867297

Anjali Sridhar added 2 commits February 24, 2021 17:35

Merge branch 'master' into offload_experimental

1eb8082

move file to the right folder

8e56a5b

blefaudeux reviewed Feb 25, 2021

View reviewed changes

fairscale/experimental/nn/offload.py Show resolved Hide resolved

min-xu-ai approved these changes Feb 25, 2021

View reviewed changes

benchmarks/experimental/offload.py Show resolved Hide resolved

fairscale/experimental/nn/offload.py Show resolved Hide resolved

fairscale/experimental/nn/offload.py Outdated Show resolved Hide resolved

fairscale/experimental/nn/offload.py Show resolved Hide resolved

Anjali Sridhar added 3 commits February 25, 2021 14:03

offload_experimental

59199b9

add doc string

ef50486

add error message

13bb24c

anj-s merged commit f7813d6 into master Feb 26, 2021

anj-s deleted the offload_experimental branch February 26, 2021 01:09

ibro45 mentioned this pull request Mar 28, 2021

Support ZeRO-Offload #337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

anj-s commented Feb 24, 2021 •

edited

Loading

min-xu-ai commented Feb 25, 2021

min-xu-ai commented Feb 25, 2021

anj-s commented Feb 25, 2021

blefaudeux Feb 25, 2021

blefaudeux Feb 25, 2021

blefaudeux commented Feb 25, 2021

min-xu-ai left a comment

[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432

Conversation

anj-s commented Feb 24, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

min-xu-ai commented Feb 25, 2021

min-xu-ai commented Feb 25, 2021

anj-s commented Feb 25, 2021

blefaudeux Feb 25, 2021

Choose a reason for hiding this comment

blefaudeux Feb 25, 2021

Choose a reason for hiding this comment

blefaudeux commented Feb 25, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

anj-s commented Feb 24, 2021 •

edited

Loading