Self Speculative Decoding at lower precisions? #10666

ElliottDyson · 2024-04-05T14:55:06Z

Hello there,

I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).

A look into the following 1.58bit quant method would also be interesting as to its integration:
ggerganov/llama.cpp#5999

I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.

Thanks

ElliottDyson · 2024-04-05T15:23:06Z

Sorry, one more thing that I forgot to add, is it possible to use your self-speculative decoding method or custom IQ2 quantisation in vLLM in any way (only typical low-bit quant was mentioned in the docs and I can't seem to find the library link to "ipex_llm.vllm.entrypoints.llm" to figure out if this is possible by myself)? I also had a thought that may work better than configuring for the various custom quants in llama.cpp such as IQ2, and that would be integrating CPU layer offloading directly into the core methods you are using here, it's just a possible alternative idea I had in case it was any easier.

Again, thank you for all the work your team have been doing here!

qiyuangong · 2024-04-07T01:51:25Z

Hi @ElliottDyson
Self-speculative decoding with vLLM is not available right now. Will let you know when it's available.

BTW, Speculative decoding support in vLLM is also in progress (https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/).

jason-dai · 2024-04-07T02:04:10Z

Hello there,

I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).

A look into the following 1.58bit quant method would also be interesting as to its integration: ggerganov/llama.cpp#5999

I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.

Thanks

@ElliottDyson currently we only optimized IQ2 for memory size, not for speed yet, and therefore using IQ2 as the draft model may not be faster than INT4; using FP8 as target model may be possible.

And we do support llama.cpp compatible IQ2 and IQ1 models using our cpp backend (see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html)

ElliottDyson · 2024-04-07T02:59:17Z

Hello there,

I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).

A look into the following 1.58bit quant method would also be interesting as to its integration: ggerganov/llama.cpp#5999

I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.

Thanks

@ElliottDyson currently we only optimized IQ2 for memory size, not for speed yet, and therefore using IQ2 as the draft model may not be faster than INT4; using FP8 as target model may be possible.

And we do support llama.cpp compatible IQ2 and IQ1 models using our cpp backend (see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html)

Just tried this combination. Thank you, FP8 as target and INT4 as draft worked very well. Looking forward to the potential of an even speedier lower precision draft model! 😁

glorysdj assigned rnwang04 Apr 7, 2024

glorysdj added the user issue label Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self Speculative Decoding at lower precisions? #10666

Self Speculative Decoding at lower precisions? #10666

ElliottDyson commented Apr 5, 2024 •

edited

Loading

ElliottDyson commented Apr 5, 2024

qiyuangong commented Apr 7, 2024 •

edited

Loading

jason-dai commented Apr 7, 2024

ElliottDyson commented Apr 7, 2024

Self Speculative Decoding at lower precisions? #10666

Self Speculative Decoding at lower precisions? #10666

Comments

ElliottDyson commented Apr 5, 2024 • edited Loading

ElliottDyson commented Apr 5, 2024

qiyuangong commented Apr 7, 2024 • edited Loading

jason-dai commented Apr 7, 2024

ElliottDyson commented Apr 7, 2024

ElliottDyson commented Apr 5, 2024 •

edited

Loading

qiyuangong commented Apr 7, 2024 •

edited

Loading