This repository has been archived due to a lack of time and resources for continued development. If you are interested in continuing the development of this project, or obtaining the crate name, please contact @philpax.
There are several high-quality alternatives for inference of LLMs and other models in Rust. We recommend that you consider using one of these libraries instead of llm
; they have been kept up-to-date and are more likely to be actively maintained.
A selection is presented below. Note that this is not an exhaustive list, and the best solution for you may have changed since this list was compiled:
- Ratchet: a
wgpu
-based ML inference library with a focus on web support and efficient inference - Candle-based libraries (i.e. pure Rust outside of platform support libraries):
- mistral.rs: supports quantized models for popular LLM architectures, Apple Silicon + CPU + CUDA support, and is designed to be easy to use
- kalosm: simple interface for language, audio and image models
- candle-transformers: first-party Candle library for inference of a wide variety of transformer-based models, similar to Hugging Face Transformers. Relatively low-level, so some knowledge of ML will be required.
- callm: supports Llama, Mistral, Phi 3 and Qwen 2
- llama.cpp wrappers (i.e. not pure Rust, but at the frontier of open-source compiled LLM inference):
- drama_llama: high-level Rust-idiomatic wrapper around
llama.cpp
- llm_client: also supports other external LLM APIs
- llama_cpp: safe, high-level Rust bindings
- llama-cpp-2: lightly-wrapped raw bindings that follow the C++ API closely
- drama_llama: high-level Rust-idiomatic wrapper around
- Aggregators of external LLM APIs:
The original README follows.
llm
is an ecosystem of Rust libraries for working with large language models -
it's built on top of the fast, efficient GGML library for
machine learning.
Image by @darthdeus, using Stable Diffusion
This library is no longer actively maintained. For reference, the following is the state of the project as of the last update.
There are currently four available versions of llm
(the crate and the CLI):
- The released version
0.1.1
oncrates.io
. This version is very out of date and does not include support for the most recent models. - The
main
branch of this repository. This version can reliably infer GGMLv3 models, but does not support GGUF, and uses an old version of GGML. - The
gguf
branch of this repository; this is a version ofmain
that supports inferencing with GGUF, but does not support any models other than Llama, requires the use of a Hugging Face tokenizer, and does not support quantization. It also uses an old version of GGML. - The
develop
branch of this repository. This is a from-scratch re-port ofllama.cpp
to synchronize with the latest version of GGML, and to support all models and GGUF. This will not be completed due to the archival of the project.
The primary entrypoint for developers is
the llm
crate, which wraps llm-base
and
the supported model crates.
Documentation for released version is available on
Docs.rs.
For end-users, there is a CLI application,
llm-cli
, which provides a convenient interface for
interacting with supported models. Text generation can be done as a
one-off based on a prompt, or interactively, through
REPL or chat modes. The CLI can also be
used to serialize (print) decoded models,
quantize GGML files, or compute the
perplexity of a model. It
can be downloaded from
the latest GitHub release or by
installing it from crates.io
.
llm
is powered by the ggml
tensor
library, and aims to bring the robustness and ease of use of Rust to the world
of large language models. At present, inference is only on the CPU, but we hope
to support GPU inference in the future through alternate backends.
Currently, the following models are supported:
- BLOOM
- GPT-2
- GPT-J
- GPT-NeoX (includes StableLM, RedPajama, and Dolly 2.0)
- LLaMA (includes Alpaca, Vicuna, Koala, GPT4All, and Wizard)
- MPT
See getting models for more information on how to download supported models.
This project depends on Rust v1.65.0 or above and a modern C toolchain.
The llm
crate exports llm-base
and the model crates (e.g. bloom
, gpt2
llama
).
Add llm
to your project by listing it as a dependency in Cargo.toml
. To use
the version of llm
you see in the main
branch of this repository, add it
from GitHub (although keep in mind this is pre-release software):
[dependencies]
llm = { git = "https://github.com/rustformers/llm" , branch = "main" }
To use a released version, add it from crates.io by specifying the desired version:
[dependencies]
llm = "0.1"
By default, llm
builds with support for remotely fetching the tokenizer from Hugging Face's model hub.
To disable this, disable the default features for the crate, and turn on the models
feature to get llm
without the tokenizer:
[dependencies]
llm = { version = "0.1", default-features = false, features = ["models"] }
NOTE: To improve debug performance, exclude the transitive ggml-sys
dependency from being built in debug mode:
[profile.dev.package.ggml-sys]
opt-level = 3
The llm
library is engineered to take advantage of hardware accelerators such as cuda
and metal
for optimized performance.
To enable llm
to harness these accelerators, some preliminary configuration steps are necessary, which vary based on your operating system. For comprehensive guidance, please refer to Acceleration Support in our documentation.
Bindings for this library are available in the following languages:
- Python: LLukas22/llm-rs-python
- Node: Atome-FE/llama-node
The easiest way to get started with llm-cli
is to download a pre-built
executable from a released
version of llm
, but the releases are currently out of date and we recommend
you install from source instead.
To install the main
branch of llm
with the most recent features to your Cargo bin
directory, which rustup
is likely to have added to your PATH
, run:
cargo install --git https://github.com/rustformers/llm llm-cli
The CLI application can then be run through llm
. See also features and
acceleration support to turn features on as required.
Note that GPU support (CUDA, OpenCL, Metal) will not work unless you build with the relevant feature.
Note that the currently published version is out of date and does not include support for the most recent models. We currently recommend that you install from source.
To install the most recently released version of llm
to your Cargo bin
directory, which rustup
is likely to have added to your PATH
, run:
cargo install llm-cli
The CLI application can then be run through llm
. See also features
to turn features on as required.
By default, llm
builds with support for remotely fetching the tokenizer from Hugging Face's model hub.
This adds a dependency on your system's native SSL stack, which may not be available on all systems.
To disable this, disable the default features for the build:
cargo build --release --no-default-features
To enable hardware acceleration, see Acceleration Support for Building section, which is also applicable to the CLI.
GGML models are easy to acquire. They are primarily located on Hugging Face (see From Hugging Face), but can be obtained from elsewhere.
Models are distributed as single files, and do not need any additional files to be downloaded. However, they are quantized with different levels of precision, so you will need to choose a quantization level that is appropriate for your application.
Additionally, we support Hugging Face tokenizers to improve the quality of
tokenization. These are separate files (tokenizer.json
) that can be used
with the CLI using the -v
or -r
flags, or with the llm
crate by
using the appropriate TokenizerSource
enum variant.
For a list of models that have been tested, see the known-good models.
Certain older GGML formats are not supported by this project, but the goal is to maintain feature parity with the upstream GGML project. For problems relating to loading models, or requesting support for supported GGML model types, please open an Issue.
Hugging Face π€ is a leader in open-source machine learning and hosts hundreds of GGML models. Search for GGML models on Hugging Face π€.
This Reddit community maintains a wiki related to GGML models, including well organized lists of links for acquiring GGML models (mostly from Hugging Face π€).
Once the llm
executable has been built or is in a $PATH
directory, try
running it. Here's an example that uses the open-source
RedPajama
language model:
llm infer -a gptneox -m RedPajama-INCITE-Base-3B-v1-q4_0.bin -p "Rust is a cool programming language because" -r togethercomputer/RedPajama-INCITE-Base-3B-v1
In the example above, the first two arguments specify the model architecture and
command, respectively. The required -m
argument specifies the local path to
the model, and the required -p
argument specifies the evaluation prompt. The
optional -r
argument is used to load the model's tokenizer from a remote
Hugging Face π€ repository, which will typically improve results when compared
to loading the tokenizer from the model file itself; there is also an optional
-v
argument that can be used to specify the path to a local tokenizer file.
For more information about the llm
CLI, use the --help
parameter.
There is also a simple inference example that is helpful for debugging:
cargo run --release --example inference gptneox RedPajama-INCITE-Base-3B-v1-q4_0.bin -r $OPTIONAL_VOCAB_REPO -p $OPTIONAL_PROMPT
Yes, but certain fine-tuned models (e.g.
Alpaca,
Vicuna,
Pygmalion) are more suited to chat use-cases than
so-called "base models". Here's an example of using the llm
CLI in REPL
(Read-Evaluate-Print Loop) mode with an Alpaca model - note that the
provided prompt format is tailored to the model
that is being used:
llm repl -a llama -m ggml-alpaca-7b-q4.bin -f utils/prompts/alpaca.txt
There is also a Vicuna chat example that demonstrates how to create a custom chatbot:
cargo run --release --example vicuna-chat llama ggml-vicuna-7b-q4.bin
Sessions can be loaded (--load-session
) or saved (--save-session
) to file.
To automatically load and save the same session, use --persist-session
. This
can be used to cache prompts to reduce load time, too.
llm
can produce a q4_0
- or
q4_1
-quantized model from an
f16
-quantized GGML model
cargo run --release quantize -a $MODEL_ARCHITECTURE $MODEL_IN $MODEL_OUT {q4_0,q4_1}
The llm
Dockerfile is in the utils
directory; the
NixOS flake manifest and lockfile are in the project root.
GitHub Issues and Discussions are welcome, or come chat on Discord!
Absolutely! Please see the contributing guide.
- llmcord: Discord bot for generating
messages using
llm
. - local.ai: Desktop app for hosting an
inference API on your local machine using
llm
. - secondbrain: Desktop app to download and run LLMs locally in your computer using
llm
. - floneum: A graph editor for local AI workflows.
- poly: A versatile LLM serving back-end with tasks, streaming completion, memory retrieval, and more.
- llm-chain: Build chains in large language models for text summarization and completion of more complex tasks