Llama 3.1 405B with 128k context length #2383
-
Hi Guys, i want to host Llama 3.1. 405B and are wondering about hardware requirements and correct TGI settings. My GPU Budget is a whole DGX (8 x H100 with 80GB each) and i think i am not able to do so. So launching with: text-generation-launcher --model-id=hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 --port=80 --max-best-of=1 --quantize=awq --max-input-tokens=127000 --max-total-tokens=128000` Results in the following error: RuntimeError: Not enough memory to handle 127050 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-08-09T07:50:03.586051Z ERROR warmup{max_input_length=127000 max_prefill_tokens=127050 max_total_tokens=128000 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED The error message itself is obvious i just wondering, if this is the correct way to do that, because i thought my DGX would be enough. Any thoughts or ideas on this? Thank you very much! Best |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Basically as I understand TGI cannot support full context length for SOTA models at this point. you can maybe get 40k tokens on a single node even with Llama 3.1 70B, Mistral Large 64k only, etc etc. See here: #2301 |
Beta Was this translation helpful? Give feedback.
Basically as I understand TGI cannot support full context length for SOTA models at this point. you can maybe get 40k tokens on a single node even with Llama 3.1 70B, Mistral Large 64k only, etc etc. See here: #2301