llama-swap

Introduction

llama-swap is an OpenAI API compatible server that gives you complete control over how you use your hardware. It automatically swaps to the configuration of your choice for serving a model. Since llama.cpp's server can't swap models, let's swap the server instead!

Features:

✅ Easy to deploy: single binary with no dependencies
✅ Easy to config: single yaml file
✅ On-demand model switching
✅ Full control over server settings per model
✅ OpenAI API support (v1/completions, v1/chat/completions, v1/embeddings and v1/rerank)
✅ Multiple GPU support
✅ Run multiple models at once with profiles
✅ Remote log monitoring at /log
✅ Automatic unloading of models from GPUs after timeout
✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabblyAPI, etc)
✅ Direct access to upstream HTTP server via /upstream/:model_id (demo)

Releases

Builds for Linux and OSX are available on the Releases page.

Building from source

Install golang for your system
git clone git@github.com:mostlygeek/llama-swap.git
make clean all
Binaries will be in build/ subdirectory

config.yaml

llama-swap's configuration is purposefully simple.

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# Write HTTP logs (useful for troubleshooting), defaults to false
logRequests: true

# define valid model values and the upstream server start
models:
  "llama":
    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

    # where to reach the server started by cmd, make sure the ports match
    proxy: http://127.0.0.1:8999

    # aliases names to use this model for
    aliases:
    - "gpt-4o-mini"
    - "gpt-3.5-turbo"

    # check this path for an HTTP 200 OK before serving requests
    # default: /health to match llama.cpp
    # use "none" to skip endpoint checking, but may cause HTTP errors
    # until the model is ready
    checkEndpoint: /custom-endpoint

    # automatically unload the model after this many seconds
    # ttl values must be a value greater than 0
    # default: 0 = never unload model
    ttl: 60

  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"

    # multiline for readability
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
    proxy: http://127.0.0.1:8999

  # unlisted models do not show up in /v1/models or /upstream lists
  # but they can still be requested as normal
  "qwen-unlisted":
    cmd: llama-server --port 9999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0
    unlisted: true

# profiles make it easy to managing multi model (and gpu) configurations.
#
# Tips:
#  - each model must be listening on a unique address and port
#  - the model name is in this format: "profile_name:model", like "coding:qwen"
#  - the profile will load and unload all models in the profile at the same time
profiles:
  coding:
    - "qwen"
    - "llama"

Advanced examples

config.example.yaml includes example for supporting v1/embeddings and v1/rerank endpoints
Speculative Decoding - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
Optimizing Code Generation - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.

Installation

Create a configuration file, see config.example.yaml
Download a release appropriate for your OS and architecture.
- Note: Windows currently untested.
Run the binary with llama-swap --config path/to/config.yaml

Monitoring Logs

Open the http://<host>/logs with your browser to get a web interface with streaming logs.

Of course, CLI access is also supported:

# sends up to the last 10KB of logs
curl http://host/logs'

# streams logs
curl -Ns 'http://host/logs/stream'

# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'

# skips history and just streams new log entries
curl -Ns 'http://host/logs/stream?no-history'

Systemd Unit Files

Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.

/etc/systemd/system/llama-swap.service

[Unit]
Description=llama-swap
After=network.target

[Service]
User=nobody

# set this to match your environment
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml

Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30

[Install]
WantedBy=multi-user.target

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
examples		examples
misc		misc
models		models
proxy		proxy
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
go.mod		go.mod
go.sum		go.sum
header.jpeg		header.jpeg
llama-swap.go		llama-swap.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-swap

Introduction

Releases

Building from source

config.yaml

Installation

Monitoring Logs

Systemd Unit Files

About

Releases 21

Contributors 2

Languages

License

mostlygeek/llama-swap

Folders and files

Latest commit

History

Repository files navigation

llama-swap

Introduction

Releases

Building from source

config.yaml

Installation

Monitoring Logs

Systemd Unit Files

About

Resources

License

Stars

Watchers

Forks

Releases 21

Contributors 2

Languages