Skip to content

Latest commit

 

History

History
73 lines (56 loc) · 4.13 KB

llm_deployment.md

File metadata and controls

73 lines (56 loc) · 4.13 KB

LLM Deployment with TorchServe

This document describes how to easily serve large language models (LLM) like Meta-Llama3 with TorchServe. Besides a quick start guide using our VLLM integration we also provide a list of examples which describe other methods to deploy LLMs with TorchServe.

Quickstart LLM Deployment

TorchServe offers easy LLM deployment through its VLLM integration. Through the integration of our LLM launcher script users are able to deploy any model supported by VLLM with a single command. The launcher can either be used standalone or in combination with our provided TorchServe GPU docker image.

To launch the docker we first need to build it:

docker build . -f docker/Dockerfile.vllm -t ts/vllm

Models are usually loaded from the HuggingFace hub and are cached in a docker volume for faster reload. If you want to access gated models like the Meta-Llama3 model you need to provide a HuggingFace hub token:

export token=<HUGGINGFACE_HUB_TOKEN>

You can then go ahead and launch a TorchServe instance serving your selected model:

docker run --rm -ti --shm-size 1g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth

To change the model you just need to exchange the identifier given to the --model_id parameter. You can test the model with:

curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"

You can change any of the sampling argument for the request by using the VLLM SamplingParams keywords. E.g. for setting the sampling temperature to 0 we can do:

curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50, "temperature": 0}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"

TorchServe's LLM launcher scripts offers some customization options as well. To rename the model endpoint from predictions/model to something else you can add --model_name <SOME_NAME> to the docker run command.

The launcher script can also be used outside a docker container by calling this after installing TorchServe following the installation instruction.

python -m ts.llm_launcher --disable_token_auth

Please note that the launcher script as well as the docker command will automatically run on all available GPUs so make sure to restrict the visible number of device by setting CUDA_VISIBLE_DEVICES.

For further customization of the handler and adding 3rd party dependencies you can have a look at out VLLM example.

Supported models

The quickstart launcher should allow to launch any model which is supported by VLLM. Here is a list of model identifiers tested by the TorchServe team:

  • meta-llama/Meta-Llama-3-8B
  • meta-llama/Meta-Llama-3-8B-Instruct
  • meta-llama/Llama-2-7b-hf
  • meta-llama/Llama-2-7b-chat-hf
  • mistralai/Mistral-7B-v0.1
  • mistralai/Mistral-7B-Instruct-v0.1

Other ways to deploy LLMs with TorchServe

TorchServe offers a variety of example on how to deploy large models. Here is a list of the current examples: