Fine tuning with FSDP v2 #44

tengomucho · 2024-05-28T15:19:45Z

What does this PR do?

After some discussions, we figured out that tuning with SPMD API as it was done in #39 might not have been the preferred way, and we moved to FSPD v2. This implementation uses SPMD underneath, so performance is very similar, with the benefit of being much simpler to integrate.
Llama and Gemma models have been tested and there are now two examples, in form of script and notebook.

Custom SPMD sharding is more complicated and does not offer signiificant performance advantages over FSDP v2 (that uses SPMD). Example is updated and two helpers have been added to optimum.tpu.

SPMD sharding is not used anymore in training, unless FSDP v2 is used, that is supported by transformers and pytorch 2.3. So this is now deleted.

HuggingFaceDocBuilderDev · 2024-05-28T15:22:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

No special need in modeling, since there is global mesh and shard marking is done in the Trainer class now.

Trying to make optimum tpu smart enough to deduce fine tune args for supported models.

mfuntowicz

LGTM! Super cool!

For gemma 7b a bigger system might be necessary to avoid OOMs.

Also if model is not matched, raise a clear error.

tengomucho added 2 commits May 23, 2024 15:34

feat(training): tuning is now done by using FSDP v2

776aa39

Custom SPMD sharding is more complicated and does not offer signiificant performance advantages over FSDP v2 (that uses SPMD). Example is updated and two helpers have been added to optimum.tpu.

chore(training): remove unused specific trainer class

ac7d8e4

SPMD sharding is not used anymore in training, unless FSDP v2 is used, that is supported by transformers and pytorch 2.3. So this is now deleted.

tengomucho requested a review from mfuntowicz May 28, 2024 15:19

tengomucho added 5 commits May 29, 2024 07:14

chore(llama): revert changes to prepare for spmd

63e1931

No special need in modeling, since there is global mesh and shard marking is done in the Trainer class now.

feat(gemma): use Linear when in SPMD/FSPD v2

05d7ab7

feat(fsdp v2): training arguments are deduced from model

228877d

Trying to make optimum tpu smart enough to deduce fine tune args for supported models.

feat(training): added Gemma training example notebook

c21885c

doc(readme): update the training section

393be20

tengomucho force-pushed the fsdp_v2 branch from ec31f9b to 393be20 Compare May 29, 2024 07:14

mfuntowicz approved these changes May 29, 2024

View reviewed changes

tengomucho added 2 commits May 29, 2024 13:04

feat(tuning): gemma tuning works well on gemma 2b

8bb637b

For gemma 7b a bigger system might be necessary to avoid OOMs.

fix(train): fsdpv2 training arguments deduced also for PEFT models

862b896

Also if model is not matched, raise a clear error.

tengomucho force-pushed the fsdp_v2 branch from d3d75ac to 862b896 Compare May 29, 2024 13:11

tengomucho merged commit e154864 into main May 31, 2024
2 checks passed

tengomucho deleted the fsdp_v2 branch May 31, 2024 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning with FSDP v2 #44

Fine tuning with FSDP v2 #44

tengomucho commented May 28, 2024

HuggingFaceDocBuilderDev commented May 28, 2024

mfuntowicz left a comment

Fine tuning with FSDP v2 #44

Fine tuning with FSDP v2 #44

Conversation

tengomucho commented May 28, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented May 28, 2024

mfuntowicz left a comment

Choose a reason for hiding this comment