Slow inference speed on RTX3090 and Xeon CPU #119

mchen30 · 2023-10-18T13:53:28Z

mchen30
Oct 18, 2023

I ran test_inference on a 3090 and a Xeon E5 CPU on Windows and the process was most likely CPU bottlenecked. I got ~80t/s with llama2 7b EXL2 4bpw and ~20t/s with 70b EXL2 2.5bpw, which is about 50% and 70% of the speed listed for a 3090Ti. One CPU core was fully loaded at 100% utilization @ 3.8Ghz during generation and I suspect it is bottlenecking. I read that @turboderp only sees 30% utilization on a 12900k, but it shouldn't be more than 2x faster than a 3.8Ghz Xeon core. Is there something wrong with my setup? Also, for the 70b 2.5bpw model, generation apparently fully loads the GPU @ 98% utilization and 400w tdp but inference speed is still only ~70% of what's listed for a 3090Ti. Is there any chance that the inference performance can still be improved on Xeon CPUs that are clocked lower?

Any help or insight is greatly appreciated.

flying-x · 2024-02-07T17:40:36Z

flying-x
Feb 7, 2024

Did you get this resolved? It seems that I have problems building sentencepiece pip package on windows.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow inference speed on RTX3090 and Xeon CPU #119

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Slow inference speed on RTX3090 and Xeon CPU #119

mchen30 Oct 18, 2023

Replies: 1 comment

flying-x Feb 7, 2024

mchen30
Oct 18, 2023

flying-x
Feb 7, 2024