This document describes how to easily serve large language models (LLM) like Meta-Llama3 with TorchServe. Besides a quick start guide using our VLLM integration we also provide a list of examples which describe other methods to deploy LLMs with TorchServe.
TorchServe offers easy LLM deployment through its VLLM integration. Through the integration of our LLM launcher script users are able to deploy any model supported by VLLM with a single command. The launcher can either be used standalone or in combination with our provided TorchServe GPU docker image.
To launch the docker we first need to build it:
docker build . -f docker/Dockerfile.vllm -t ts/vllm
Models are usually loaded from the HuggingFace hub and are cached in a docker volume for faster reload. If you want to access gated models like the Meta-Llama3 model you need to provide a HuggingFace hub token:
export token=<HUGGINGFACE_HUB_TOKEN>
You can then go ahead and launch a TorchServe instance serving your selected model:
docker run --rm -ti --shm-size 1g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
To change the model you just need to exchange the identifier given to the --model_id
parameter.
You can test the model with:
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
You can change any of the sampling argument for the request by using the VLLM SamplingParams keywords. E.g. for setting the sampling temperature to 0 we can do:
curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50, "temperature": 0}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
TorchServe's LLM launcher scripts offers some customization options as well.
To rename the model endpoint from predictions/model
to something else you can add --model_name <SOME_NAME>
to the docker run
command.
The launcher script can also be used outside a docker container by calling this after installing TorchServe following the installation instruction.
python -m ts.llm_launcher --disable_token_auth
Please note that the launcher script as well as the docker command will automatically run on all available GPUs so make sure to restrict the visible number of device by setting CUDA_VISIBLE_DEVICES.
For further customization of the handler and adding 3rd party dependencies you can have a look at out VLLM example.
The quickstart launcher should allow to launch any model which is supported by VLLM. Here is a list of model identifiers tested by the TorchServe team:
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Llama-2-7b-hf
- meta-llama/Llama-2-7b-chat-hf
- mistralai/Mistral-7B-v0.1
- mistralai/Mistral-7B-Instruct-v0.1
TorchServe offers a variety of example on how to deploy large models. Here is a list of the current examples: