That 4 bit KV cache looks cool. It would make a great CFG cache optimization. #362
Closed
tau0-deltav
started this conversation in
Ideas
Replies: 1 comment
-
Update: My idea is a bad idea because it turns out in the case that you have a Q4 CFG cache there's no reason you'd want KV in higher precision anyway. To be honest it looks like there's no reason you'd want KV in anything but Q4 in any scenario? Cripes. You bet Q4's not even slower than fp8 - and FP8 wasn't slow. Now Aphrodite's running EXL2 with Q8 that's also stronger than their (real) fp8 - but you've gotta set it up per model? Cool but NOT COOL ENOUGH. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Edit: hang on, is this even ellamav2's job or is CFG constructed by tabby and such with it's classes?
So, my question is: given that the CFG cache is invisible and its effect so... subjective. Could we have options to compress it and only it? It's not like it would have to work very well to... work very well.
This would enable long context CFG .
I'm asking about doing this by using the quantized cache modes because it seems like you've gone to a lot of effort making sure the different types play nice together.
Also does anyone have figures/anecdotes about fp8 KV? I never found anything other than 'negligible penalty.' I've not had problems myself but I worry I'm missing something when I use it (outside of CFG). Aphrodite are doing it too now.
More seriously, is there somewhere I could go to learn about EXL2's design? Sometimes I feel like I missed the preprint?
Ty for reading.
<s>
are accepted. might be tabbyapi or even tavern, not exl2 itself.Beta Was this translation helpful? Give feedback.
All reactions