Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3-8b FP8 PTQ OOM #9981

Closed
JeevanBhoot opened this issue Jul 31, 2024 · 2 comments
Closed

Llama3-8b FP8 PTQ OOM #9981

JeevanBhoot opened this issue Jul 31, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@JeevanBhoot
Copy link

Describe the bug

Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU.
What are the minimum requirements to run this quantization?

I have even tried setting batch size to 1 and it still goes OOM.

Steps/Code to reproduce bug

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path ./llama3_8b_instruct.nemo --precision bf16

python examples/nlp/language_modeling/megatron_gpt_ptq.py model.restore_from_path=llama3_8b_instruct.nemo quantization.algorithm=fp8 export.decoder_type=llama export.save_path=llama3_8b_instruct_fp8 export.inference_tensor_parallel=1 trainer.num_nodes=1 trainer.devices=1

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of NeMo install: source
  • If method of install is [Docker], provide docker pull & docker run commands used:
docker run --gpus all -it --rm -v ./NeMo:/NeMo --shm-size=8g \-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3
@JeevanBhoot JeevanBhoot added the bug Something isn't working label Jul 31, 2024
@elliottnv elliottnv assigned elliottnv and janekl and unassigned elliottnv Jul 31, 2024
@janekl
Copy link
Collaborator

janekl commented Aug 13, 2024

Thanks for reporting this. This workload -- examples/nlp/language_modeling/megatron_gpt_ptq.py -- was tested on 1xH100 80GB GPU but I agree that it looks excessive, we'll take a look. For completeness: could you perhaps share the full log for your issue?

Please also note that currently this memory requirements would apply only for the calibration step (i.e. the script linked above). For TensorRT-LLM engine the memory consumption is as expected: ca. ~9GB for FP8 Llama3-8B model (you can create engines using tests/export/nemo_export.py, for example).

@janekl
Copy link
Collaborator

janekl commented Aug 23, 2024

To successfully calibrate and export the Llama3-8b model on, for example, L4 24GB GPU you can use:

python megatron_gpt_ptq.py \
    model.restore_from_path=Llama-8b \
    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \
    quantization.algorithm=fp8 \
    export.dtype=bf16 \
    inference.batch_size=16

Explanations:

  1. The knobs below correctly setup the model for evaluation in bf16. Disabling dist_ckpt_load_on_device avoids memory spikes on model loading:
    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \
  1. Parameter export.dtype=bf16 should be the same as model precision -- either 16 or bf16 -- to avoid data cast on the export step that may also cause OOM.
  2. I had to use slightly lower inference.batch_size=16.

This should do it. Let me know if you have any other issues/questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants