This repository contains a compilation of examples of optimized deployment of popular Large Language Models (LLMs) utilizing SageMaker Inference. Hosting LLMs comes with a variety of challenges due to the size of the model, inefficient usage of hardware, and scaling LLMs into a production like environment with multiple concurrent users.
SageMaker Inference is a highly performant and versatile hosting option that comes with a variety of options that you can utilize to efficiently host your LLMs. In this repository we showcase how you can take different SageMaker Inference options such as Real-Time Inference (low latency, high throughput use-cases) and Asynchronous Inference (near real-time/batch use-cases) and integrate with Model Servers such as DJL Serving and Text Generation Inference. We showcase how you can tune for performance via optimizing these different Model Serving stacks and also exploring hardware options such as Inferentia2 integration with Amazon SageMaker.
If you are contributing, please add a link to your model below:
- Introduction to Large Model Inference Container
- LLM Inference Optimization Toolkit
- Large Model Inference Container Tuning Guide
- Text Generation Inference with Amazon SageMaker
- Server Side Batching Optimizations with LMI
- General SageMaker Hosting Examples Repo
- SageMaker Hosting Blog Series
- Easily deploy and manage hundreds of LoRA Adapters
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.