llama-cpp-cffi

Python binding for llama.cpp using cffi. Supports CPU, Vulkan 1.x and CUDA 12.6 runtimes, x86_64 and aarch64 platforms.

NOTE: Currently supported operating system is Linux (manylinux_2_28 and musllinux_1_2), but we are working on both Windows and MacOS versions.

News

Dec 9 2024, v0.2.0: Support for low-level and high-level APIs: llama, llava, clip and ggml API
Nov 27 2024, v0.1.22: Support for Multimodal models such as llava and minicpmv.

Install

Basic library install:

pip install llama-cpp-cffi

IMPORTANT: If you want to take advantage of Nvidia GPU acceleration, make sure that you have installed CUDA 12. If you don't have CUDA 12.X installed follow instructions here: https://developer.nvidia.com/cuda-downloads .

GPU Compute Capability: compute_61, compute_70, compute_75, compute_80, compute_86, compute_89 covering from most of GPUs from GeForce GTX 1050 to NVIDIA H100. GPU Compute Capability.

LLM Example

from llama import Model


#
# first define and load/init model
#
model = Model(
    creator_hf_repo='HuggingFaceTB/SmolLM2-1.7B-Instruct',
    hf_repo='bartowski/SmolLM2-1.7B-Instruct-GGUF',
    hf_file='SmolLM2-1.7B-Instruct-Q4_K_M.gguf',
)

model.init(ctx_size=8192, predict=1024, gpu_layers=99)

#
# messages
#
messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': '1 + 1 = ?'},
    {'role': 'assistant', 'content': '2'},
    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},
]

for chunk in model.completions(messages=messages, temp=0.7, top_p=0.8, top_k=100):
    print(chunk, flush=True, end='')

#
# prompt
#
for chunk in model.completions(prompt='Evaluate 1 + 2 in Python. Result in Python is', temp=0.7, top_p=0.8, top_k=100):
    print(chunk, flush=True, end='')

VLM Example

from llama import Model


#
# first define and load/init model
#
model = Model( # 1.87B
    creator_hf_repo='vikhyatk/moondream2',
    hf_repo='vikhyatk/moondream2',
    hf_file='moondream2-text-model-f16.gguf',
    mmproj_hf_file='moondream2-mmproj-f16.gguf',
)

model.init(ctx_size=8192, predict=1024, gpu_layers=99)

#
# prompt
#
for chunk in model.completions(prompt='Describe this image.', image='examples/llama-1.png'):
    print(chunk, flush=True, end='')

References

examples/llm.py
examples/vlm.py

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
examples		examples
llama		llama
misc		misc
scripts		scripts
.gitignore		.gitignore
BUILD.md		BUILD.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile_6.patch		Makefile_6.patch
README.md		README.md
clip_cpp_6.patch		clip_cpp_6.patch
clip_h_6.patch		clip_h_6.patch
json_schema_to_grammar_cpp_6.patch		json_schema_to_grammar_cpp_6.patch
json_schema_to_grammar_h_6.patch		json_schema_to_grammar_h_6.patch
llava_cpp_6.patch		llava_cpp_6.patch
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-cpp-cffi

News

Install

LLM Example

VLM Example

References

About

Releases 14

Packages

Languages

License

tangledgroup/llama-cpp-cffi

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-cffi

News

Install

LLM Example

VLM Example

References

About

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Languages

Packages