Skip to content

Use VLLM as backend

Nathan Habib edited this page Sep 24, 2024 · 2 revisions

Lighteval allows you to use vllm as backend allowing great speedups. To use, simply change the model_args to reflect the arguments you want to pass to vllm.

lighteval accelerate \
    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    --tasks "leaderboard|truthfulqa:mc|0|0" \
    --output_dir="./evals/"

vllm is able to distribute the model across multiple GPUs using data parallelism, pipeline parallelism or tensor parallelism. You can choose the parallelism method by setting in the the model_args.

For example if you have 4 GPUs you can split it across using tensor_parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval accelerate \
    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    --tasks "leaderboard|truthfulqa:mc|0|0" \
    --output_dir="./evals/"

Or, if your model fits on a single GPU, you can use data_parallelism to speed up the evaluation:

lighteval accelerate \
    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    --tasks "leaderboard|truthfulqa:mc|0|0" \
    --output_dir="./evals/"

Available arguments for vllm can be found in the VLLMModelConfig:

  • pretrained (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
  • gpu_memory_utilisation (float): The fraction of GPU memory to use.
  • revision (str): The revision of the model.
  • dtype (str, None): The data type to use for the model.
  • tensor_parallel_size (int): The number of tensor parallel units to use.
  • data_parallel_size (int): The number of data parallel units to use.
  • max_model_length (int): The maximum length of the model.
  • swap_space (int): The CPU swap space size (GiB) per GPU.
  • seed (int): The seed to use for the model.
  • trust_remote_code (bool): Whether to trust remote code during model loading.
  • add_special_tokens (bool): Whether to add special tokens to the input sequences.
  • multichoice_continuations_start_space (bool): Whether to add a space at the start of each continuation in multichoice generation.

Warning

In the case of OOM issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilisation parameter.