Exllamav2 production prospect #355

luisfrentzen-cc · 2024-02-27T04:41:54Z

luisfrentzen-cc
Feb 27, 2024

In my own testing I find exllamav2 is extremely fast compared to other libraries, but have heard that it is not designed for production use(?). I want to clarify some things as I want to adopt the library to power my inference endpoint. Is exllamav2 and exl format optimal for production? If yes what guideline should I follow to ensure effective inference? If not, why is that and is there any way to mitigate the effect?

Answered by turboderp

Feb 27, 2024

It depends what you mean by production. If you mean running on a large inference server with many concurrent users, then no, it's not all too well suited for that. I would consider paged attention an essential feature, for instance (for efficient continuous batching). That may be coming soon, but this is all still largely a solo project and I only have so much time to dedicate to each feature.

What's more, as you go up in batch size, the benefits of quantization start to matter less and less. The amount of VRAM required for context scales with the number of concurrent users you want to support, while the weights stay the same size. So when you eventually have to reserve 200 GB of VRAM for…

View full answer

turboderp · 2024-02-27T13:52:58Z

turboderp
Feb 27, 2024
Maintainer

It depends what you mean by production. If you mean running on a large inference server with many concurrent users, then no, it's not all too well suited for that. I would consider paged attention an essential feature, for instance (for efficient continuous batching). That may be coming soon, but this is all still largely a solo project and I only have so much time to dedicate to each feature.

What's more, as you go up in batch size, the benefits of quantization start to matter less and less. The amount of VRAM required for context scales with the number of concurrent users you want to support, while the weights stay the same size. So when you eventually have to reserve 200 GB of VRAM for caches, the difference between a 20 GB EXL2 model vs maybe a slightly larger AWQ model won't really matter. If you plan on starting small and maybe scaling up later, you'll definitely have an easier time down the line with something like vLLM or TGI. Though licensing could be a concern too, I suppose.

You could consider building around tabbyAPI. It hosts ExLlamaV2 with an OAI compatible API and gives you all of immediate benefits, but with a layer of abstraction so you can easily switch out the backend if it starts to become a bottleneck.

1 reply

luisfrentzen-cc Mar 4, 2024
Author

ah I see, thank you got it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllamav2 production prospect #355

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Exllamav2 production prospect #355

luisfrentzen-cc Feb 27, 2024

Replies: 1 comment · 1 reply

turboderp Feb 27, 2024 Maintainer

luisfrentzen-cc Mar 4, 2024 Author

luisfrentzen-cc
Feb 27, 2024

Replies: 1 comment 1 reply

turboderp
Feb 27, 2024
Maintainer

luisfrentzen-cc Mar 4, 2024
Author