Highlights in this release
This release features performance improvements for supported models.
Upcoming features
- Development is well underway for sharded, "tensor parallel" serving of Large Language Models (LLMs). See the latest set of supported models and up to date serving instructions in our Llama serving user guide.
- Support for high performance serving across a wider range of model architectures is in progress. See the
sharktank/models/
directory for the latest updates.
Changelog
Full list of changes: v3.1.0...v3.2.0