Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use 3090 to train 16k model? #4

Open
aresa7796 opened this issue Jul 1, 2023 · 7 comments
Open

How to use 3090 to train 16k model? #4

aresa7796 opened this issue Jul 1, 2023 · 7 comments
Labels
good first issue Good for newcomers

Comments

@aresa7796
Copy link

I have 80k supervised data, but only 3090 graphics card, how to use 3090 to train 16k model?

@musabgultekin
Copy link

While technically it can work, its probably gonna take too much VRAM and will be horribly slow.
Checkout:
https://huggingface.co/docs/transformers/perf_train_gpu_one

@DachengLi1
Copy link
Owner

@aresa7796 The current code is assuming 8xA100 40GB. I think 3090 should be able to run after applying some system techniques. I think if we can support training for 3090 GPUs (or non-A100), it will be really amazing. We just didn't get a hand on it now, can you try and share some of your feedback? Here are the steps I think should work:

(1) Use deepspeed zero offloading as shared by @musabgultekin ;
(2) Change the monkey patch from flash attention to xformer by calling this function. Xformer is a memory efficient attention which supports non-a100 GPUs. I already have the monkey patch implemented.:P
(3) Change bf16 (delete the tf32 argument as well) to fp16 in the training command.

Let me know if this works for you!

@DachengLi1 DachengLi1 added the good first issue Good for newcomers label Jul 1, 2023
@lucasjinreal
Copy link

Am also wondering for this. For instance, using v100 which might not possible feed 2048 at all, if using 1024 and applying condensing rotary embeddings in a ratio of 16, will work? How good?

@DachengLi1
Copy link
Owner

@lucasjinreal condensing rotary does not reduce memory, it only makes model good quality with 16K.

@lucasjinreal
Copy link

@DachengLi1 what I menas, v100 can not feed too much minimal len like 2048 for most cases.

@DachengLi1
Copy link
Owner

@lucasjinreal i see thanks! Condensing will be great, I believe it should work from 1024 to 8192 say. But the thing is you will still need to fine-tune on the longer length a bit after condensing - but I guess you can resort to A100 for that adapting part?

@lucasjinreal
Copy link

@DachengLi1 hi, wanna discuss a bit more, have u tried compare with your method with ALibi on Extrapolation ability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants