Skip to content

Latest commit

 

History

History

llama-cpp-python

LeapfrogAI LLaMA C++ Python Backend

A LeapfrogAI API-compatible llama-cpp-python wrapper for quantized and un-quantized model inferencing across CPU infrastructures.

Usage

Pre-Requisites

See the LeapfrogAI documentation website for system requirements and dependencies.

Dependent Components

Model Selection

The default model that comes with this backend in this repository's officially released images is a quantization of the Synthia-7b model.

Models are pulled from HuggingFace Hub via the model_download.py script. To change what model comes with the llama-cpp-python backend, set the following environment variables:

REPO_ID   # eg: "TheBloke/SynthIA-7B-v2.0-GGUF"
FILENAME  # eg: "synthia-7b-v2.0.Q4_K_M.gguf"
REVISION  # eg: "3f65d882253d1f15a113dabf473a7c02a004d2b5"

If you choose a different model, make sure to modify the default config.yaml using the Hugging Face model repository's model files and model card.

Deployment

To build and deploy the llama-cpp-python backend Zarf package into an existing UDS Kubernetes cluster:

Important

Execute the following commands from the root of the LeapfrogAI repository

pip install 'huggingface_hub[cli,hf_transfer]'  # Used to download the model weights from huggingface
make build-llama-cpp-python LOCAL_VERSION=dev
uds zarf package deploy packages/llama-cpp-python/zarf-package-llama-cpp-python-*-dev.tar.zst --confirm

Local Development

To run the llama-cpp-python backend locally:

Important

Execute the following commands from this sub-directory

# Install dev and runtime dependencies
make install

# Clone Model
# Supply a REPO_ID, FILENAME and REVISION, as seen in the "Model Selection" section
python scripts/model_download.py
mv .model/*.gguf .model/model.gguf

# Start the model backend
make dev