-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lonnnnnnnnng context load time before generation #34
Comments
What code did you run? |
I would like to confirm this issue as well. It really becomes noticeable when they're running chat vs normal/notebook. Chat with nothing set runs really fast but once you start putting context etc... start up speed just takes a nose dive. 4bit 65b on my A6000 |
In the case of llama.cpp, when a long prompt is given you can see it output the provided prompt word by word at a slow rate even before it starts generating anything new. It's directly evident that it takes a longer time to to get through larger prompts. I guess a similar thing is happening here. |
So I compared B&B 8bit and GPTQ 8bit and GPTQ was the only one that had a start delay. Something is causing a delay before anything starts generating. |
Runs pretty well once it starts.. not sure if it's loading something, reading layers then inferencing. It's definitely got its quirks of new tech, might just be a case of "well that's how it works" |
Probably fixed now, see #30. |
I think this issue has been resolved. |
I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?
The text was updated successfully, but these errors were encountered: