An inference engine for the Llama2 model family, written in C++.
- CMake >=3.18
- GCC >=12 (C++20 support required)
- OpenSSL
- Protobuf
- Google Logging Library
For building the CUDA version, you will also need:
Make sure your nvcc
is compatible with your GCC version.
mkdir build
cd build
cmake ..
make -j`nproc`
Please adjust according to your setup as needed. Tested only on Ubuntu 22.04
and later. This program requires C++20 support, and makes use of some
Linux-specific system calls (like memfd_create
), although in a limited way.
For more information on building this project, please take a look at the Dockerfile.amd64 and Dockerfile.cuda files.
For testing purposes, you can use
tinyllamas. Please use the
tools/bin2glint.py
script to convert .bin
files to Glinthawk's format. There
are other scripts in the tools
directory for converting the original Llama2
models.
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.