A chat interface based on llama.cpp for running Alpaca models. Entirely self-hosted, no API keys needed. Fits on 4GB of RAM and runs on the CPU.
- SvelteKit frontend
- Redis for storing chat history & parameters
- FastAPI + langchain for the API, wrapping calls to llama.cpp using the python bindings
demo.webm
Deploy and click the URI in Cloudmos.
The API documentation can be found at http://localhost:8008/api/docs
Currently the following models are supported:
- GPT4-Alpaca-LoRA-30B
- Alpaca-LoRA-65B
- OpenAssistant-30B
- GPT4All-13B
- Stable-Vicuna-13B
- Guanaco-7B
- Guanaco-13B
- Guanaco-33B
- Guanaco-65B
If you have existing weights from another project you can add them to the serge_weights
volume using docker cp
.
LLaMA will just crash if you don't have enough available memory for your model.
- 7B requires about 4.5GB of free RAM
- 13B requires about 12GB free
- 30B requires about 20GB free
Feel free to join the discord if you need help with the setup: https://discord.gg/62Hc6FEYQH
Serge is always open for contributions! If you catch a bug or have a feature idea, feel free to open an issue or a PR.
- Front-end to interface with the API
- Pass model parameters when creating a chat
- Manager for model files
- Support for other models
- LangChain integration
- User profiles & authentication
And a lot more!