Loading Llama-3-70B-Instruct-AWQ with larger context across two GPUs #2093
Unanswered
daytonturner
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I'm currently running an AWQ quantized Llama-3-70B-Instruct model, and TGI will say it has reduced the max_total_tokens, max_input_tokens etc down to 4096 to allow for multi-user batching. Makes sense. It manages to load this into One 48GB RTX 6000, originally peaking at about 46GB while the model warms up, then dropping back down to 41.9GB after its fully loaded.
Now, I happen to have two of these 48GB cards in the system, so I've been trying to increase the context size to 8k (or even 6k) and it seems the only way I'm able to do this is to specify --gpus all and --num-shards 2 in order to get it to load across two cards. While this does work, regardless of if I use 8k or even 5k max total tokens, it will load both cards up and use roughly the same amount of VRAM across both cards. Presumably this is due to sharding having some overlap and duplicating what gets loaded on to each card (no idea, just a theory). But, both cards will use ~38-41GB of VRAM each.
I'm curious if im doing this correctly - it seems pretty wild that adjusting this even to 5k would consume basically double the VRAM, and also that I cant ask it to use as much of the VRAM from GPU0 first, then proceed on to GPU1 - instead leaving ~8gb each per card.
Is there another approach I'm missing here, or is this just how it works?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions