Docker images are based on Nvidia CUDA images. LLMs are pre-loaded and served via vLLM.
TENSOR_PARALLEL_SIZE
: Number of GPUs to use. Default:1
.
The OpenAI API is exposed on port 8000
.
Note
The VRAM column is the minimum required amount of VRAM used by the model on a single GPU.
Tag | Model | RunPod | Vast.ai | VRAM |
---|---|---|---|---|
ivangabriele/llm:lmsys__vicuna-13b-v1.5-16k |
26GB | |||
ivangabriele/llm:open-orca__llongorca-13b-16k |
26GB |
- Add more popular models.
- Start the server in background to allow for SSH access.