Llama3-8b FP8 PTQ OOM #9981

JeevanBhoot · 2024-07-31T14:17:17Z

Describe the bug

Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU.
What are the minimum requirements to run this quantization?

I have even tried setting batch size to 1 and it still goes OOM.

Steps/Code to reproduce bug

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path ./llama3_8b_instruct.nemo --precision bf16

python examples/nlp/language_modeling/megatron_gpt_ptq.py model.restore_from_path=llama3_8b_instruct.nemo quantization.algorithm=fp8 export.decoder_type=llama export.save_path=llama3_8b_instruct_fp8 export.inference_tensor_parallel=1 trainer.num_nodes=1 trainer.devices=1

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo install: source
If method of install is [Docker], provide docker pull & docker run commands used:

docker run --gpus all -it --rm -v ./NeMo:/NeMo --shm-size=8g \-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3

The text was updated successfully, but these errors were encountered:

janekl · 2024-08-13T15:05:58Z

Thanks for reporting this. This workload -- examples/nlp/language_modeling/megatron_gpt_ptq.py -- was tested on 1xH100 80GB GPU but I agree that it looks excessive, we'll take a look. For completeness: could you perhaps share the full log for your issue?

Please also note that currently this memory requirements would apply only for the calibration step (i.e. the script linked above). For TensorRT-LLM engine the memory consumption is as expected: ca. ~9GB for FP8 Llama3-8B model (you can create engines using tests/export/nemo_export.py, for example).

janekl · 2024-08-23T18:15:37Z

To successfully calibrate and export the Llama3-8b model on, for example, L4 24GB GPU you can use:

python megatron_gpt_ptq.py \
    model.restore_from_path=Llama-8b \
    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \
    quantization.algorithm=fp8 \
    export.dtype=bf16 \
    inference.batch_size=16

Explanations:

The knobs below correctly setup the model for evaluation in bf16. Disabling dist_ckpt_load_on_device avoids memory spikes on model loading:

    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \

Parameter export.dtype=bf16 should be the same as model precision -- either 16 or bf16 -- to avoid data cast on the export step that may also cause OOM.
I had to use slightly lower inference.batch_size=16.

This should do it. Let me know if you have any other issues/questions.

JeevanBhoot added the bug Something isn't working label Jul 31, 2024

elliottnv assigned elliottnv and janekl and unassigned elliottnv Jul 31, 2024

janekl mentioned this issue Aug 27, 2024

Load model in the target export precision by default in PTQ #10267

Merged

8 tasks

janekl closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3-8b FP8 PTQ OOM #9981

Llama3-8b FP8 PTQ OOM #9981

JeevanBhoot commented Jul 31, 2024

janekl commented Aug 13, 2024

janekl commented Aug 23, 2024

Llama3-8b FP8 PTQ OOM #9981

Llama3-8b FP8 PTQ OOM #9981

Comments

JeevanBhoot commented Jul 31, 2024

janekl commented Aug 13, 2024

janekl commented Aug 23, 2024