This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenstorrent/vllm/tree/dev
If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for First run setup.
Run the container from the project root at tt-inference-server
:
cd tt-inference-server
# make sure if you already set up the model weights and cache you use the correct persistent volume
export MODEL_VOLUME=$PWD/persistent_volume/volume_id_tt-metal-Llama-3.3-70B-Instructv0.0.1/
docker run \
--rm \
-it \
--env-file persistent_volume/model_envs/Llama-3.3-70B-Instruct.env \
--cap-add ALL \
--device /dev/tenstorrent:/dev/tenstorrent \
--volume /dev/hugepages-1G:/dev/hugepages-1G:rw \
--volume ${MODEL_VOLUME?ERROR env var MODEL_VOLUME must be set}:/home/user/cache_root:rw \
--shm-size 32G \
--publish 7000:7000 \
ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-v0.54.0-rc2-953161188c50
By default the Docker container will start running the entrypoint command wrapped in src/run_vllm_api_server.py
.
This can be run manually if you override the the container default command with an interactive shell via bash
.
In an interactive shell you can start the vLLM API server via:
# run server manually
python run_vllm_api_server.py
The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell.
You can use docker exec -it <container-id> bash
to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies):
# oneliner to enter interactive shell on most recently ran container
docker exec -it $(docker ps -q | head -n1) bash
# inside interactive shell, run example clients script making requests to vLLM server:
cd ~/app/src
# this example runs a single request from alpaca eval, expecting and parsing the streaming response
python example_requests_client_alpaca_eval.py --stream True --n_samples 1 --num_full_iterations 1 --batch_size 1
# this example runs a full-dataset stress test with 32 simultaneous users making requests
python example_requests_client_alpaca_eval.py --stream True --n_samples 805 --num_full_iterations 1 --batch_size 32
Tested starting condition is from a fresh installation of Ubuntu 20.04 with Tenstorrent system dependencies installed.
see Ubuntu apt guide: https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository
Recommended to follow postinstall guide to allow $USER to run docker without sudo: https://docs.docker.com/engine/install/linux-postinstall/
- tt-smi: https://github.com/tenstorrent/tt-smi
- firmware: bundle 80.10.1.0 (https://github.com/tenstorrent/tt-firmware/blob/02b4b6ed49b6ea2fb9a8664e99d4fed25e443bd6/experiments/fw_pack-80.10.1.0.fwbundle)
- drivers: tt-kmd version 1.29 (https://github.com/tenstorrent/tt-kmd/tree/ttkmd-1.29)
- topology: ensure mesh topology https://github.com/tenstorrent/tt-topology
- hugepages: https://github.com/tenstorrent/tt-system-tools
In order to get peak performance increasing the CPU frequency profile is recommended. If you cannot do this for your setup, it is optional and can be skipped, though performance may be lower than otherwise expected.
sudo apt-get update && sudo apt-get install -y linux-tools-generic
# enable perf mode
sudo cpupower frequency-set -g performance
# disable perf mode (if desired later)
# sudo cpupower frequency-set -g ondemand
Either download the Docker image from GitHub Container Registry (recommended for first run) or build the Docker image locally using the dockerfile.
# pull image from GHCR
docker pull ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-v0.54.0-rc2-953161188c50
For instructions on building the Docker imagem locally see: vllm-tt-metal-llama3/docs/development
The script setup.sh
automates:
- interactively creating the model specific .env file,
- downloading the model weights,
- (if required) repacking the weights for tt-metal implementation,
- creating the default persistent storage directory structure and permissions.
cd tt-inference-server
chmod +x setup.sh
./setup.sh llama-3.1-70b-instruct