Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is llama_synchronize called? #6366

Closed
EricLBuehler opened this issue Mar 28, 2024 · 3 comments
Closed

Why is llama_synchronize called? #6366

EricLBuehler opened this issue Mar 28, 2024 · 3 comments

Comments

@EricLBuehler
Copy link

EricLBuehler commented Mar 28, 2024

Hello all,

I was reading through the codebase and saw llama_synchronize was being called when the logits are retrieved:

GGML_CALL static void ggml_backend_cuda_synchronize(ggml_backend_t backend) {

During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!

Edit: When I run a flamegraph, I get this:
llamacpp
It seems like avoiding the sync would be very beneficial!

@compilade
Copy link
Collaborator

This is something new since pipeline parallelism has been implemented (at least for CUDA) in #6017

why are the logits returned before the GPU is done "working"?

They are actually returned after, this is exactly what llama_synchronize is used for.

In llama_decode, the logits are copied asynchronously1 to the output buffer in llama_decode, so that it can return before computing the outputs has finished.

llama.cpp/llama.cpp

Lines 10030 to 10037 in 0308f5e

float * logits_out = lctx.logits + n_outputs_prev*n_vocab;
const int32_t n_outputs_new = lctx.n_outputs;
if (n_outputs_new) {
GGML_ASSERT( n_outputs_prev + n_outputs_new <= n_outputs);
GGML_ASSERT((n_outputs_prev + n_outputs_new)*n_vocab <= (int64_t) lctx.logits_size);
ggml_backend_tensor_get_async(backend_res, res, logits_out, 0, n_outputs_new*n_vocab*sizeof(float));
}

When llama_get_logits_ith is called, it first calls llama_synchronize to ensure data has been copied,

llama.cpp/llama.cpp

Lines 15175 to 15176 in 0308f5e

float * llama_get_logits_ith(struct llama_context * ctx, int32_t i) {
llama_synchronize(ctx);

then it extracts the specified logits.

return ctx->logits + j*ctx->model.hparams.n_vocab;

What operations cause this?

Any operation which returns the content of the output buffer calls llama_synchronize before accessing the values. (e.g. llama_get_logits, llama_get_logits_ith, llama_get_embeddings, llama_get_embeddings_ith, and llama_get_embeddings_seq)

Note that llama_sampling_sample indirectly calls llama_get_logits_ith, which is what is shown in your flamegraph.

Footnotes

  1. async copy falls back to synchronous copy when the backend doesn't support it. See https://github.com/ggerganov/llama.cpp/blob/0308f5e3d7bf9879f818b1a4ae589ff36b242af5/ggml-backend.c#L214-L215

@EricLBuehler
Copy link
Author

Thanks for the detailed explanation, that makes sense. I was wondering, how does the computation graph allow async GPU (CUDA) operations? If you were to build a graph for the Llama architecture, wouldn't all parts need to be sequentially executed? I am sure this is wrong because llama.cpp would not implement it otherwise.

@slaren
Copy link
Collaborator

slaren commented Mar 29, 2024

Async operations are queued into an asynchronous queue (in CUDA this is just a stream) and executed sequentially. The copy doesn't happen until the computation is completed.

Repository owner locked and limited conversation to collaborators Mar 29, 2024
@slaren slaren converted this issue into discussion #6385 Mar 29, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants