That 4 bit KV cache looks cool. It would make a great CFG cache optimization. #362

tau0-deltav · 2024-03-05T22:28:50Z

tau0-deltav
Mar 5, 2024

CFG hugely increases generation quality. Even if you don't put anything in the box*
CFG adds the memory requirement of a second KV cache of the same length.
Let's ignore how absurd those two seem and assume computer scientists are smart.

Edit: hang on, is this even ellamav2's job or is CFG constructed by tabby and such with it's classes?

ExllamaV2 is FAST, fastest generating FP16 KV caches.
4 BIT KV cache hype! takes up less memory.

So, my question is: given that the CFG cache is invisible and its effect so... subjective. Could we have options to compress it and only it? It's not like it would have to work very well to... work very well.
This would enable long context CFG .

I'm asking about doing this by using the quantized cache modes because it seems like you've gone to a lot of effort making sure the different types play nice together.

Also does anyone have figures/anecdotes about fp8 KV? I never found anything other than 'negligible penalty.' I've not had problems myself but I worry I'm missing something when I use it (outside of CFG). Aphrodite are doing it too now.

More seriously, is there somewhere I could go to learn about EXL2's design? Sometimes I feel like I missed the preprint?
Ty for reading.

*i think it actually throws an exception rn but it's not supposed to. can't verify that so no report but heads up. a newline or <s> are accepted. might be tabbyapi or even tavern, not exl2 itself.

tau0-deltav · 2024-03-13T01:24:44Z

tau0-deltav
Mar 13, 2024
Author

Update: My idea is a bad idea because it turns out in the case that you have a Q4 CFG cache there's no reason you'd want KV in higher precision anyway.

To be honest it looks like there's no reason you'd want KV in anything but Q4 in any scenario?
omitting a disparaging remark about the Metropolitan Police

Cripes. You bet Q4's not even slower than fp8 - and FP8 wasn't slow.

Now Aphrodite's running EXL2 with Q8 that's also stronger than their (real) fp8 - but you've gotta set it up per model? Cool but NOT COOL ENOUGH.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

That 4 bit KV cache looks cool. It would make a great CFG cache optimization. #362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

That 4 bit KV cache looks cool. It would make a great CFG cache optimization. #362

tau0-deltav Mar 5, 2024

Replies: 1 comment

tau0-deltav Mar 13, 2024 Author

tau0-deltav
Mar 5, 2024

tau0-deltav
Mar 13, 2024
Author