Replies: 1 comment
-
Did you get this resolved? It seems that I have problems building sentencepiece pip package on windows. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I ran test_inference on a 3090 and a Xeon E5 CPU on Windows and the process was most likely CPU bottlenecked. I got ~80t/s with llama2 7b EXL2 4bpw and ~20t/s with 70b EXL2 2.5bpw, which is about 50% and 70% of the speed listed for a 3090Ti. One CPU core was fully loaded at 100% utilization @ 3.8Ghz during generation and I suspect it is bottlenecking. I read that @turboderp only sees 30% utilization on a 12900k, but it shouldn't be more than 2x faster than a 3.8Ghz Xeon core. Is there something wrong with my setup? Also, for the 70b 2.5bpw model, generation apparently fully loads the GPU @ 98% utilization and 400w tdp but inference speed is still only ~70% of what's listed for a 3090Ti. Is there any chance that the inference performance can still be improved on Xeon CPUs that are clocked lower?
Any help or insight is greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions