Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend : add eval callback #4935

Merged
merged 9 commits into from
Jan 17, 2024
Merged

backend : add eval callback #4935

merged 9 commits into from
Jan 17, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jan 14, 2024

ref: #4931

# Metal
make -j && ./simple ./models/llama-7b/ggml-model-q4_0.gguf "Hello, my name is" 1

# CUDA
LLAMA_CUBLAS=1 make -j && ./simple ./models/llama-7b/ggml-model-q4_0.gguf "Hello, my name is" 1

The callback currently observes the softmax results in the attention, but can be customized in any way:

// a function that can be called for every computed node during graph evaluation
// the user can choose to whether to observe the data of the node depending on the tensor parameters
static bool observe_compute(int node_index, struct ggml_tensor * t, bool ask, void * user_data) {
GGML_UNUSED(user_data);
// the scheduler is asking us if we want to observe this node
if (ask) {
// check if name contains soft_max
return strstr(t->name, "soft_max") != 0;
}
// print the node data
printf("%s: node_index = %5d, t->name = %32s, t->op = %12s, [%5d, %5d, %5d, %5d]\n",
__func__, node_index, t->name, ggml_op_name(t->op), (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]);
std::vector<float> t_data(ggml_nelements(t));
ggml_backend_tensor_get(t, t_data.data(), 0, ggml_nbytes(t));
// print first row
for (int i = 0; i < t->ne[0]; i++) {
printf("%8.4f ", t_data[i]);
}
printf("\n");
return true;
}

Skip last CLI arg (or set to 0) to disable the callback

@ggerganov ggerganov force-pushed the gg/sched-eval-callback-4931 branch from 40cdb39 to 83f3d7a Compare January 15, 2024 14:24
@ggerganov ggerganov marked this pull request as ready for review January 15, 2024 14:31
@ggerganov ggerganov requested a review from slaren January 15, 2024 14:31
examples/simple/simple.cpp Outdated Show resolved Hide resolved
examples/simple/simple.cpp Outdated Show resolved Hide resolved
ggml-backend.c Outdated
Comment on lines 1356 to 1359
if (sched->callback_eval(t, true, sched->callback_eval_user_data) && // ask
!sched->callback_eval(t, false, sched->callback_eval_user_data)) { // eval
break;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ask callback really necessary here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the implementation to ask only once per node in a split

llama.cpp Show resolved Hide resolved
ggml-backend.c Outdated
Comment on lines 1387 to 1390
// TODO: should we clear the callbacks?
//sched->callback_eval = NULL;
//sched->callback_eval_user_data = NULL;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we don't need to clear the callbacks here, the reset function is meant to prepare the sched for the next graph evaluation, resetting the allocators and the backend assignments (similar to ggml_allocr_reset).

@ggerganov ggerganov added the sync Requires sync with the ggml repo after merging label Jan 17, 2024
@ggerganov ggerganov merged commit 44a1a4a into master Jan 17, 2024
38 of 47 checks passed
@ggerganov ggerganov deleted the gg/sched-eval-callback-4931 branch January 17, 2024 16:39
brittlewis12 added a commit to brittlewis12/llama.cpp that referenced this pull request Jan 18, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sync Requires sync with the ggml repo after merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants