Basic Llama2 Tuning #39

tengomucho · 2024-05-13T13:16:03Z

What does this PR do?

Add support for Llama model tuning on TPU v5e, with a related example.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

This will be then tweaked for TPU/XLA. Original transformers version is 4.40.0, commit 745bbfe.

This should reduce memory consumption, with low performance loss.

Imported as-is, from version 4.39.3.

For RowParallelLinear and ColumnParallelLinear use Linear instead of the dedicated class, to avoid issues with the backward step.

For now it does not seem to work, seems to be related to shapes of key states that are not compatible with the attention calculation after update. We can investigate the reason and propose a better solution later.

HuggingFaceDocBuilderDev · 2024-05-13T13:19:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alanwaketan · 2024-05-13T18:21:16Z

Is there any reasons why you choose to use 2D sharding instead of FSDPv2? The latter is integrated in to transformer trainer: https://huggingface.co/blog/gemma-peft#accelerate-with-fsdp-via-spmd-on-tpu

tengomucho · 2024-05-13T19:47:17Z

@alanwaketan yes, the reason is that it did not work out of the box, apparently the trainer was trying to use some API from the XLA that was in experimental and it is not available anymore, so the code was raising an exception. I guess we can fix the FSDP in transformers too, and compare both to see which one performs better.

alanwaketan · 2024-05-13T19:52:57Z

@alanwaketan yes, the reason is that it did not work out of the box, apparently the trainer was trying to use some API from the XLA that was in experimental and it is not available anymore, so the code was raising an exception. I guess we can fix the FSDP in transformers too, and compare both to see which one performs better.

Can you point me to the error? All the API the trainer uses should be available in 2.3.

alanwaketan · 2024-05-13T19:54:32Z

FSDPv2's intention is to releif the need for all this complicated shardings from regular users. Unless you know you need 2D sharding instead of 1D.

tengomucho · 2024-05-13T20:03:03Z

@alanwaketan sure it makes total sense, I will get back to you with the error

tengomucho · 2024-05-13T20:11:00Z

@alanwaketan Ah btw I was still using torch xla 2.2, we haven't moved to 2.3 yet. I will do that so I can re-test FSDP.

tengomucho · 2024-05-13T20:25:32Z

@alanwaketan you can check the errors I see with FSDP here.

This part uses the multihost environment, tested on v5e litepod16.

tengomucho · 2024-05-14T14:01:48Z

I raised an issue on the accelerate project to fix the FSDP/dataloader issue huggingface/accelerate#2775

alanwaketan · 2024-05-14T19:14:16Z

@alanwaketan Ah btw I was still using torch xla 2.2, we haven't moved to 2.3 yet. I will do that so I can re-test FSDP.

2.2 is expected to not work. The 2.3 issue seems more generic than FSDPv2.

mfuntowicz

LGTM! 🔥

tengomucho added 9 commits May 6, 2024 12:22

chore: import run_clm from transformers

5fcebf6

This will be then tweaked for TPU/XLA. Original transformers version is 4.40.0, commit 745bbfe.

feat(llama): torch_dtype is now bfloat16

1990414

This should reduce memory consumption, with low performance loss.

chore(llama): added changes to prepare for spmd

8570131

chore: import trainer from transformers

3194a1c

Imported as-is, from version 4.39.3.

feat(xla model parallel): use Linear on SPMD

f721d7c

For RowParallelLinear and ColumnParallelLinear use Linear instead of the dedicated class, to avoid issues with the backward step.

feat(llama): pick Linear implementation using create

ef50868

fix(llama): disable cache when using SPMD

5001cad

For now it does not seem to work, seems to be related to shapes of key states that are not compatible with the attention calculation after update. We can investigate the reason and propose a better solution later.

feat(training): adapt trainer and run_clm to optimum and SPMD

6b3c3b8

feat(examples): add a short example on llama fine-tuning

bd772b9

tengomucho marked this pull request as ready for review May 13, 2024 14:14

tengomucho requested a review from mfuntowicz May 13, 2024 14:14

feat(examples): add Llama3-8b tuning instructions

3a12371

This part uses the multihost environment, tested on v5e litepod16.

mfuntowicz approved these changes May 15, 2024

View reviewed changes

tengomucho merged commit 8a70783 into main May 15, 2024
2 checks passed

tengomucho deleted the basic-training branch May 15, 2024 21:42

tengomucho mentioned this pull request May 28, 2024

Fine tuning with FSDP v2 #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Llama2 Tuning #39

Basic Llama2 Tuning #39

tengomucho commented May 13, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 13, 2024

alanwaketan commented May 13, 2024

tengomucho commented May 13, 2024

alanwaketan commented May 13, 2024

alanwaketan commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 14, 2024

alanwaketan commented May 14, 2024

mfuntowicz left a comment

Basic Llama2 Tuning #39

Basic Llama2 Tuning #39

Conversation

tengomucho commented May 13, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented May 13, 2024

alanwaketan commented May 13, 2024

tengomucho commented May 13, 2024

alanwaketan commented May 13, 2024

alanwaketan commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 13, 2024

tengomucho commented May 14, 2024

alanwaketan commented May 14, 2024

mfuntowicz left a comment

Choose a reason for hiding this comment

tengomucho commented May 13, 2024 •

edited

Loading