-
Notifications
You must be signed in to change notification settings - Fork 44
2.2.11 Backend: AirLLM
Handle:
airllm
URL: http://localhost:33981
Quickstart | Configurations | MacOS | Example notebooks | FAQ
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.
Note that above is true, but don't expect a performant inference. AirLLM loads LLM layers into memory in small groups. The main benefit is that it allows a "transformers"-like workflow for models that are much much larger than your VRAM.
Note
AirLLM requires a GPU with CUDA by default, can't be run on CPU.
# [Optional] Pre-build the image
# Needs PyTorch and CUDA, so will be quite large
harbor build airllm
# Start the service
# Will download selected models if not present yet
harbor up airllm
# See service logs
harbor logs airllm
# Check it's running
curl $(harbor url airllm)/v1/models
AirLLM only supports specific models, see original README for details.
Note
Default context size is 128, according to official examples
For funsies, Harbor also adds an OpenAI-compatible API to AirLLM, so you can enjoy 40m of wait per request when "chatting" with Llama 3.1 405B.
# Set the model to run
harbor airllm model meta-llama/Meta-Llama-3.1-8B-Instruct
# Set the context size to use
harbor airllm ctx 1024
# Set the compression
harbor airllm compression 4bit