-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe #636
Comments
Related issues#505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.### DetailsSimilarity score: 0.92 - [ ] [LoRAX Docs](https://predibase.github.io/lorax/?h=cpu#features)LoRAX DocsMulti-LoRA inference server that scales to 1000s of fine-tuned LLMs📖 What is LoRAX? LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. 🌳 Features
URL: https://predibase.github.io/lorax/?h=cpu#features Suggested labels{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }#408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp### DetailsSimilarity score: 0.85 - [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md)Llama Benchmarking ToolThis is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats. Table of Contents
Syntax
Multiple values can be given for each parameter by separating them with Examples
Text generation with different modelsYou can test the performance of the library with different models by specifying the model file using the Prompt processing with different batch sizesYou can test the performance of the library with different batch sizes by specifying the batch size using the Different numbers of threadsYou can test the performance of the library with different number of threads by specifying the number of threads using the Different numbers of layers offloaded to the GPUYou can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the Output formatsThe benchmarking tool supports the following output formats:
You can specify the output format using the Suggested labels#457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA### DetailsSimilarity score: 0.85 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/)Here's the reformatted text in Markdown format: # GPU and Model Recommendations
**GPU Only:**
- You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context.
- If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context.
**GPU + CPU:**
- Use `.gguf` files to offload part of the model to VRAM.
- Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low.
- In that case, reduce context size and reduce bpw. The best models you can probably run now are:
- OpenChat 3.5 7B at 8bpw (Use the latest version)
- <https://huggingface.co/vicgalle/solarized-18B-dpo-GGUF> at 4bpw and 4K context.
- If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB.
**Tip from /u/Working-Flatworm-531:**
- Just do not load kv in VRAM, you can use `ooba` to disable it.
- Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something.
- You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context.
**Recommended List from /u/Working-Flatworm-531:**
- Use Linux.
- Overclock RAM (if possible).
- Overclock CPU (if possible).
- Overclock GPU.
- Don't load kv cache in VRAM, instead load more layers to the VRAM.
- Use smaller quants.
- Use fast interface (didn't try Kobold, use Ooba).
- Check RAM (should be dual channel). Suggested labels{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }#153: Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog### DetailsSimilarity score: 0.85 - [ ] [Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)
This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section. #628: LLaVA/README.md at main · haotian-liu/LLaVA### DetailsSimilarity score: 0.84 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1)LLaVA/README.md at main · haotian-liu/LLaVA🌋 LLaVA: Large Language and Vision AssistantVisual instruction tuning towards large language and vision models with GPT-4 level capabilities. 📢 LLaVA-NeXT Blog Project Page Demo Data Model Zoo 🤝Community Contributions: llama.cpp Colab 🤗Space Replicate AutoGen BakLLaVA Improved Baselines with Visual Instruction Tuning Paper HF Visual Instruction Tuning (NeurIPS 2023, Oral) Paper HF Release
More
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ContentsSuggested labels#332: streaming-llm: Efficient Streaming Language Models with Attention Sinks### DetailsSimilarity score: 0.84 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens. |
S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe
DESCRIPTION:
S-LoRA describes a set of optimizations for running thousands of separate LLMs simultaneously on the same GPU. At OpenPipe we’ve been running S-LoRA in production since January 4th, which critically allowed us to eliminate the cold-start problem for infrequently-used models. I wanted to share some of our learnings from the implementation process here!
But first, here’s the average cold-start response time we’re seeing after enabling the S-LoRA based pipeline:
The Problem of Weights
Modern LLMs require a lot of GPU RAM. A “small” model like Mistral 7B requires 14GB of RAM just to hold the weights, in addition to the working memory required for the KV cache, which can be multiple GB for long sequences. This means that even a very beefy GPU like an A100-40GB only has room to load one or maybe two 7B LLMs in RAM at once. Quantization can reduce the required memory, but it also leads to decreased throughput, and often hurts response quality as well.
This is not really a problem if you’re using one general-purpose model for everything, and just steering its behavior via prompting. In that case you can just load up your model on one GPU and call it a day. But fine-tuning is a far more reliable way of directing model behavior than prompting. Concretely, we’ve found that 7B models fine-tuned on a good dataset consistently outperform prompted GPT-3.5 (20B parameters), and even come within striking distance of GPT-4 (1.7T parameters)!
The downside, of course, is that now you have to figure out how to serve all those task-specific fine-tuned models efficiently. Spinning up a dedicated GPU for each model is a non-starter because it leads to low GPU utilization, which is an existential issue because of how expensive GPU time is ($2+/hr for an A100). How do we square the circle?
Serving all the models everywhere all at once
First, a bit of background: in 2021 a new fine-tuning method called LoRA was published. The key insight is that fine-tuning only a tiny fraction of the base model’s weights can give you similar results to fine-tuning all of them, since you want your fine-tuned model to keep most of the world understanding and reasoning ability of its base. The LoRA technique involves cleverly inserting extra adapter layers in a few carefully-selected locations and only fine-tuning those. These adapters are analogous to a “git diff” that encodes only the difference in weights between the base model and your fine-tune.
These adapters can be tiny. In OpenPipe’s case, our Mistral adapters are 80MB each, only 0.5% the size of the 14GB base model. This immediately points to the shape of the solution: is it possible to load many adapters from the same base model onto one GPU and use them simultaneously, efficiently?
It turns out the answer is “yes”! Two influential papers from late 2023 help define the solution.
Punica implements a clever CUDA kernel that is able to batch-process requests from many LoRA adapters simultaneously. This custom kernel is essential, because the naive approach taken by most libraries pre-Punica required swapping adapters for each request, eliminating the critical throughput increases from serving many requests in parallel.
S-LoRA builds on Punica and adds a tiered caching architecture. It dynamically stores the most-recently-used adapters in GPU RAM, less-recently-used adapters in system RAM, and the least-recently-used adapters on disk. For a typical setup with 10GB of available GPU RAM and 1TB of system RAM, S-LoRA might store 125 adapters on the GPU and over 10K in system RAM. The overhead of restoring an adapter from system RAM to the GPU is negligible in practice; an A100 has 31GB/s of interconnect bandwidth so an 80MB adapter can be transferred in 2.4ms. This can happen in parallel with serving other requests.
This actually works!
On January 4th we deployed an experimental inference pipeline based on a vLLM fork that implements the relevant optimizations. After manually moving a few models over and closely monitoring performance, we enabled the pipeline for all new models on January 10th, and began porting over old models as well.
Over the course of this transition, the average number of GPUs in use has dropped by over 70%, even as the number of requests we serve has continued increasing! Our average response time for models coming up from a cold start (ie weights not already loaded onto a GPU) decreased from 45 seconds to 1 second, giving customers a lot more flexibility to deploy many small specialist models. And ultimately, that’s exactly what we’re here to do. 🙂
URL: https://openpipe.ai/blog/s-lora
Suggested labels
{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}
The text was updated successfully, but these errors were encountered: