nanochatllms.cpp is a repository containing pure C++ implementations of Chat-LLMs with less than 3 billion parameters. The goal is to provide implementation of quantised small Chat-LLMs that can run efficiently on lower-end devices. The models are implemented in fp16, 8-bit and 4-bit formats. This project was inspired by llama.cpp and llama.c
- MiniCPM-2B DPO source License
- StableLM-2-Zephyr-1.6B source
- TinyLlama-1.1B-Chat-v0.4 source
- OpenELM source
git clone https://github.com/iangitonga/nanochatllms.cpp.git
cd nanochatllms.cpp/
cmake -S . -B build/ -DCMAKE_BUILD_TYPE=Release
cmake --build build/
build/bin/nanochat -m tinyllama -p "Give three tips on staying healthy."
To see all the available options, run
build/bin/nanochat --help
Note: Performance was recorded on a Intel-Xeon CPU @ 2.20GHz with two cores with AVX enabled.
Model | Format | Model size (GB) | Performance (tokens/sec) |
---|---|---|---|
MiniCPM-2B | FP16 | 5.5 | 2.4 |
Q8 | 2.9 | 3.2 | |
Q4 | 1.5 | 3.4 | |
Zephyr-1.6B | FP16 | 3.3 | 4.2 |
Q8 | 1.8 | 5.2 | |
Q4 | 0.9 | 5.6 | |
TinyLlama-1.1B | FP16 | 2.2 | 6.3 |
Q8 | 1.2 | 8.7 | |
Q4 | 0.6 | 9.1 |