-
Notifications
You must be signed in to change notification settings - Fork 9.8k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is llama_synchronize
called?
#6366
Comments
This is something new since pipeline parallelism has been implemented (at least for CUDA) in #6017
They are actually returned after, this is exactly what In Lines 10030 to 10037 in 0308f5e
When Lines 15175 to 15176 in 0308f5e
then it extracts the specified logits. Line 15195 in 0308f5e
Any operation which returns the content of the output buffer calls Note that Footnotes
|
Thanks for the detailed explanation, that makes sense. I was wondering, how does the computation graph allow async GPU (CUDA) operations? If you were to build a graph for the Llama architecture, wouldn't all parts need to be sequentially executed? I am sure this is wrong because llama.cpp would not implement it otherwise. |
Async operations are queued into an asynchronous queue (in CUDA this is just a stream) and executed sequentially. The copy doesn't happen until the computation is completed. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hello all,
I was reading through the codebase and saw
llama_synchronize
was being called when the logits are retrieved:During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!
Edit: When I run a flamegraph, I get this:
It seems like avoiding the sync would be very beneficial!
The text was updated successfully, but these errors were encountered: