Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Paged Attention #333

Open
vikigenius opened this issue Jun 26, 2023 · 10 comments
Open

Paged Attention #333

vikigenius opened this issue Jun 26, 2023 · 10 comments
Labels
issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features

Comments

@vikigenius
Copy link

Just found a recent blog https://vllm.ai/ and repo https://github.com/vllm-project/vllm that implements paged attention. Tested this out and it provides massive throughput and memory efficiency improvements.

Can we implement something like this? The paper isn't out yet. But shouldn't Rust be very good at this in theory with it's memory safety guarantees.

@Mellonta
Copy link

Does it have any benefit on cpu-only inference, given that host memory is already paged?

@okpatil4u
Copy link

@vikigenius could you please share your benchmarks with vllm vs llama.cpp for gpu ? That will give us some insight into potential speed up.

@vikigenius
Copy link
Author

@okpatil4u I don't have the benchmarks for llama.cpp. I primarily noticed the speed up between the PyTorch implementations with and without paged attention. And there is no reason to think an algorithmic change like that wouldn't translate across languages.

We tested it on NVIDIA A100 GPUs and got significant speedup I will try to get the numbers soon, once we have access to them again.

@vikigenius
Copy link
Author

@okpatil4u got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec
Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

@okpatil4u
Copy link

okpatil4u commented Jun 28, 2023 via email

@vikigenius
Copy link
Author

Well as I mentioned before we don't actually use llama.cpp at work in our A100s, so my benchmark numbers are comparing pytorch implementations.

It is possible that at this point llama.cpp itself is a bit better than the PyTorch implementation which might explain the discrepancy.

But given how big the gain is I would expect that if you port Paged Attention to llama.cpp you should see similar gains there as well ?

@vikigenius
Copy link
Author

The discussion here might be relevant ggerganov/llama.cpp#1955 although it seems many people are misunderstanding how the paging works.

It should be hugely beneficial for any batched inference workloads even on a single GPU

@philpax
Copy link
Collaborator

philpax commented Jun 28, 2023

Unfortunately, we are likely beholden to what upstream GGML supports, as this would be applied at that layer of the execution. This is something we could potentially implement with #312, but even then we'd need to work with wonnx to support this.

I'll leave this issue open for now, but I don't think we'll see much movement here from our end, sorry :/

@philpax philpax added issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features labels Jul 2, 2023
@AmineDiro
Copy link
Contributor

Hello,
I recently saw ggerganov PR ggerganov/llama.cpp#3228 where he implemented parallel decoding for multiple sequences. Is there any plan on supporting this feature ?
This would basically provide a mechanism for doing batch inference 🤔
Thx

@philpax
Copy link
Collaborator

philpax commented Oct 31, 2023

Hi, that would be nice to have! I'm not sure if we'll get around to it any time soon as it'll require updating our GGML version and setting up all of the required structures, but I'll see what can be done once we get there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features
Projects
None yet
Development

No branches or pull requests

5 participants