Exllamav2 production prospect #355
-
In my own testing I find exllamav2 is extremely fast compared to other libraries, but have heard that it is not designed for production use(?). I want to clarify some things as I want to adopt the library to power my inference endpoint. Is exllamav2 and exl format optimal for production? If yes what guideline should I follow to ensure effective inference? If not, why is that and is there any way to mitigate the effect? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It depends what you mean by production. If you mean running on a large inference server with many concurrent users, then no, it's not all too well suited for that. I would consider paged attention an essential feature, for instance (for efficient continuous batching). That may be coming soon, but this is all still largely a solo project and I only have so much time to dedicate to each feature. What's more, as you go up in batch size, the benefits of quantization start to matter less and less. The amount of VRAM required for context scales with the number of concurrent users you want to support, while the weights stay the same size. So when you eventually have to reserve 200 GB of VRAM for caches, the difference between a 20 GB EXL2 model vs maybe a slightly larger AWQ model won't really matter. If you plan on starting small and maybe scaling up later, you'll definitely have an easier time down the line with something like vLLM or TGI. Though licensing could be a concern too, I suppose. You could consider building around tabbyAPI. It hosts ExLlamaV2 with an OAI compatible API and gives you all of immediate benefits, but with a layer of abstraction so you can easily switch out the backend if it starts to become a bottleneck. |
Beta Was this translation helpful? Give feedback.
It depends what you mean by production. If you mean running on a large inference server with many concurrent users, then no, it's not all too well suited for that. I would consider paged attention an essential feature, for instance (for efficient continuous batching). That may be coming soon, but this is all still largely a solo project and I only have so much time to dedicate to each feature.
What's more, as you go up in batch size, the benefits of quantization start to matter less and less. The amount of VRAM required for context scales with the number of concurrent users you want to support, while the weights stay the same size. So when you eventually have to reserve 200 GB of VRAM for…