You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU.
What are the minimum requirements to run this quantization?
I have even tried setting batch size to 1 and it still goes OOM.
Thanks for reporting this. This workload -- examples/nlp/language_modeling/megatron_gpt_ptq.py -- was tested on 1xH100 80GB GPU but I agree that it looks excessive, we'll take a look. For completeness: could you perhaps share the full log for your issue?
Please also note that currently this memory requirements would apply only for the calibration step (i.e. the script linked above). For TensorRT-LLM engine the memory consumption is as expected: ca. ~9GB for FP8 Llama3-8B model (you can create engines using tests/export/nemo_export.py, for example).
Parameter export.dtype=bf16 should be the same as model precision -- either 16 or bf16 -- to avoid data cast on the export step that may also cause OOM.
I had to use slightly lower inference.batch_size=16.
This should do it. Let me know if you have any other issues/questions.
Describe the bug
Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU.
What are the minimum requirements to run this quantization?
I have even tried setting batch size to 1 and it still goes OOM.
Steps/Code to reproduce bug
Environment overview (please complete the following information)
docker pull
&docker run
commands used:The text was updated successfully, but these errors were encountered: