You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
I'm pretty new to the whole LLM thing and am trying to do some comparisons across different processors (i.e. CPU MAX, skylake, EPYC, NVIDIA GPU, etc...) For the CPU part of this I'm focused on the intel extensions for tranformers (and neural-speed underneath). I'd like to use different quantization (INT8, FP16, FP32, etc...) with different "acceleration" (i.e. AVX2, AVX512, AMX, etc...) I'm having a hard time getting my head around how to run with different quantization and/or acceleration. I'd like to start with the safetensors from https://huggingface.co/bigscience/bloom (for limitations placed on me).
Is the quantization implicit in the model file I run with? Are there tools to convert from *.safetensors to model files that would then be usable by intel extension/neural-speed?
Are there arguments to intel/extensions/neural-speed functions to internally do the conversion? How can I see what's actually being used internally to the intel extensions code? (I've poked around in the code, but there's so much there and I haven't really gotten my heads around it)
I appreciated any discussion that helps me get on the right track.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm pretty new to the whole LLM thing and am trying to do some comparisons across different processors (i.e. CPU MAX, skylake, EPYC, NVIDIA GPU, etc...) For the CPU part of this I'm focused on the intel extensions for tranformers (and neural-speed underneath). I'd like to use different quantization (INT8, FP16, FP32, etc...) with different "acceleration" (i.e. AVX2, AVX512, AMX, etc...) I'm having a hard time getting my head around how to run with different quantization and/or acceleration. I'd like to start with the safetensors from https://huggingface.co/bigscience/bloom (for limitations placed on me).
Is the quantization implicit in the model file I run with? Are there tools to convert from *.safetensors to model files that would then be usable by intel extension/neural-speed?
Are there arguments to intel/extensions/neural-speed functions to internally do the conversion? How can I see what's actually being used internally to the intel extensions code? (I've poked around in the code, but there's so much there and I haven't really gotten my heads around it)
I appreciated any discussion that helps me get on the right track.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions