Skip to content

Rohan138/llm-inference

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-inference

This repository provides inference utilities including benchmarking tools for large language models served or accelerated through Hugging Face, Deep Speed and Faster Transformer, etc. over common Deep Learning frameworks and Data Center GPUs (AMD: MI300, MI250, MI200, MI100; Nvidia: H100, A100, V100).

LLM models supported

Prerequisites:

Inference benchmarking utilities

  • ibench_hf.py reports prefill latency and decode (per token generation) latency to arbitary batch size, prompt (input) size, generation (output) size provided
# prerequisites:
# To enable faster access and loading for models, we expect they stay local:
# - OPT-66B model tokenizer and parameters are prelocated at: /data/opt66b
# - LLaMa-65B model tokenizer and parameters are prelocated at: /data/llama65b
# - Falcon-40B-instruct tokenizer and parameters are prelocated at: /data/falcon40b-instruct
# - LLaMa-2-70B-chat model tokenizer and parameters are prelocated at: /data/llama2-70b-chat
# when adding a new model of big size, you can do the same.

/dockerx/llm-inference# python ibench_hf.py --help
usage: ibench_hf.py [-h] [--model MODEL] [--platform PLATFORM] [--precision PRECISION] [--n N] [--d] [--nocache] [--debug] [--profiling]

LLM Inference Benchmark Example

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         name of LLM (opt66b | llama65b | falcon40b-instruct | llama2-70b-chat | llama2-70b) for inference (default: opt66b)
  --platform PLATFORM   name of DL platform (MI300X | 2xH100 | 2xMI250 | 4xMI250) for inference (default: MI300X)
  --precision PRECISION
                        model precision and data type (float16 | bfloat16) for inference (default: float16)
  --n N                 number of iterations to inference; report an average of this number of runs (default: 10)
  --d                   use deterministic prompts like: An increasing sequence: -5 -4 -3 -2 -1 0
  --nocache             Disable KV caching (default: on) for transformer inference
  --debug               Print token generations for debugging (default: off)
  --profiling           Enable DeepSpeed Flops Profiler Profiling (default: off)
/dockerx/llm-inference#

Examples:

python ibench_hf.py --n 1 --debug
python ibench_hf.py --model llama65b
python ibench_hf.py --model opt66b --n 5

Inference with DeepSpeed Accelerations: Tensor Parallelism and Kernel Injections

  • deepspeed/ibench_ds.py reports prefill latency and decode (per token generation) latency to arbitary batch size, prompt (input) size, generation (output) size provided, with DeepSpeed acceleration, with or without Tensor Parallelism, with or without Kernel injections.
  • performance benefit from TP is best seen with very fast inter-GPU interconnect (faster than PCI-e): AMD Infinity Fabric Link or Nvidia NVLink
  • note: with deepspeed 0.10.x, may need to update OpenAI Triton with pip install --pre -U triton or pip install triton==2.0.0.dev20221120
  • note: with TP --num_gpus <= total available GPUs
# prerequisites:
# To enable faster access and loading for models, download and store converted model weights and tokenizer local, provide the path to --name
# For AMD GPUs, install DeepSpeed from https://github.com/ROCmSoftwarePlatform/DeepSpeed -b kernel_injection_UT_enablement

/dockerx/llm-inference# python deepspeed/ibench_ds.py --help
usage: ibench_ds.py [-h] --name NAME [--checkpoint_path CHECKPOINT_PATH] [--save_mp_checkpoint_path SAVE_MP_CHECKPOINT_PATH] [--batch_size BATCH_SIZE] [--dtype {float32,float16,int8}] [--ds_inference] [--use_kernel] [--replace_method REPLACE_METHOD] [--max_tokens MAX_TOKENS]
                    [--prompting_length PROMPTING_LENGTH] [--max_new_tokens MAX_NEW_TOKENS] [--sampling] [--use_meta_tensor] [--performance] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --name NAME           model_name
  --checkpoint_path CHECKPOINT_PATH
                        model checkpoint path
  --save_mp_checkpoint_path SAVE_MP_CHECKPOINT_PATH
                        save-path to store the new model checkpoint
  --batch_size BATCH_SIZE
                        batch size
  --dtype {float32,float16,int8}
                        data-type
  --ds_inference        enable ds-inference
  --use_kernel          enable kernel-injection
  --replace_method REPLACE_METHOD
                        replace method['', 'auto']
  --max_tokens MAX_TOKENS
                        maximum tokens used for the text-generation KV-cache
  --prompting_length PROMPTING_LENGTH
                        length of prompts in tokens
  --max_new_tokens MAX_NEW_TOKENS
                        maximum new tokens to generate
  --sampling            sample generation mode
  --use_meta_tensor     use the meta tensors to initialize model
  --performance         enable latency, bandwidth and throughout run
  --local_rank LOCAL_RANK
                        local rank
  --world_size WORLD_SIZE
                        world_size
  --debug               Print token generations for debugging (default: off)
/dockerx/llm-inference#

Examples:

deepspeed --num_gpus 1 deepspeed/ibench_ds.py --name /data/llama2-7b  --batch_size  8 --prompting_length 512 --performance --ds_inference --max_new_tokens  32
deepspeed --num_gpus 1 deepspeed/ibench_ds.py --name /data/llama2-7b  --batch_size 32 --prompting_length 512 --performance --ds_inference --max_new_tokens  64 --use_kernel
deepspeed --num_gpus 4 deepspeed/ibench_ds.py --name /data/llama65b   --batch_size 16 --prompting_length 512 --performance --ds_inference --max_new_tokens  64 --use_kernel
deepspeed --num_gpus 8 deepspeed/ibench_ds.py --name facebook/opt-66b --batch_size 32 --prompting_length 512 --performance --ds_inference --max_new_tokens 256 --use_kernel

On AMD GPUs, to speedup DS JIT compilation, you may specify GCN architecture code:
- MI300X: PYTORCH_ROCM_ARCH='gfx940' deepspeed --num_gpus 1 deepspeed/ibench_ds.py --name /data/llama2-7b --batch_size 32 --prompting_length 512 --performance --ds_inference --max_new_tokens 32 --use_kernel
- MI2xx:  PYTORCH_ROCM_ARCH='gfx90a' deepspeed --num_gpus 4 deepspeed/ibench_ds.py --name /data/llama2-7b --batch_size 16 --prompting_length 512 --performance --ds_inference --max_new_tokens 32 --use_kernel
- MI100:  PYTORCH_ROCM_ARCH='gfx908' deepspeed --num_gpus 8 deepspeed/ibench_ds.py --name /data/llama2-7b --batch_size  8 --prompting_length 512 --performance --ds_inference --max_new_tokens 32 --use_kernel

Inference with vLLM

  • vllm/ibench_vllm.py reports prefill latency and decode (per token generation) latency to arbitary batch size, prompt (input) size, generation (output) size provided, with or without Tensor Parallelism.
  • note: need to modify some model config to support 4K and 8K sequence length (sum of prompt_len plus output_len), e.g.
  • to support 4K sequence length for Llama2-13b, modify model's config.json to "max_position_embeddings": 4096
  • to support 8K sequence length for Llama2-70b, modify model's config.json to "max_position_embeddings": 8192
  • make sure the sum(prompt_len, output_len) < max_position_embeddings
# prerequisites:
# To enable faster access and loading for models, download and store converted model weights and tokenizer local, provide the path to --name
# install vLLM to your container (with AMD / Nvidia GPU) first

Examples:

# on 1 MI300X
python ibench_vllm.py --model /data/llama2/llama2-70B-chat-hf --tensor-parallel-size 1 --input-len 8128 --batch-size 4 --output-len 32

# on 8 H100
python ibench_vllm.py --model /data/llama2/llama2-70B-chat-hf --tensor-parallel-size 8 --input-len 4096 --batch-size 2

Status

We support multiple GPUs, multiple nodes, and multiple dimensional parallelism, some by implicit software setup, some by explicit argumentation.

Support and harness around following inference infrastructures are working in progress:

  • Faster Transformer
  • PyTorch FSDP

TODOs:

  • MFU (Model FLOPs Utilization)

License

MIT. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%