Add importance matrix calculation to non-CPU back-ends #4931

ikawrakow · 2024-01-14T12:00:54Z

The imatrix tool, which computes an "importance matrix" that can be used to improve quantization accuracy, currently only works when run on the CPU, which is quite slow. In addition, when llama.cpp is built with CUDA support enabled, the call to the data collection function is bypassed, and one gets an empty result, which is inconvenient and leads to confusion.

Also, given the discussions around PRs #4897, #4861, #4856, #4773, where importance matrix capabilities were added to llama.cpp, there appears to be a lot of interest in experimenting with different training dataset to create the importance matrix. But experimentation is difficult with the much lower CPU performance compared to the GPU.

So, overall, it would be very useful to support importance matrix calculations on faster back-ends (CUDA, Metal, etc.).

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-01-14T12:20:29Z

From an API standpoint, we should be able to pass the callback through the llama.h API. When there is a callback provided, we would then compute the graph node-by-node and call it for each result. And if possible, we would probably want to filter which nodes are passed to the callback - either by op type or name, so that we avoid extra data copying for nodes which are filtered.

cc @slaren for insights

slaren · 2024-01-14T13:07:02Z

I think this is a good solution, the only change I would make to this is having the user receive a ggml_tensor* that they can choose to read or ignore, then all the filtering would be done on the application side. To recap:

Add a function to ggml_backen_sched to set a callback
If they callback is set, nodes are executed one at a time and the callback is called with the node after each op
For CPU, this will have the overhead of launching the threads for every op
With CUDA, evaluation is asynchronous and a synchronization will only happen if the user calls ggml_backend_tensor_get to read the result, so the performance should be good for cases where the user is only interested in a few tensors and ignores the rest
We can expose this functionality in the llama.cpp API by allowing the user to set the callback. The callback would receive a ggml_tensor*, all the filtering would be up to the application. I think it is ok to expose ggml types in the llama.cpp API for advanced use cases such as this.

ggerganov · 2024-01-14T13:19:10Z

the only change I would make to this is having the user receive a ggml_tensor* that they can choose to read or ignore, then all the filtering would be done on the application side.

I'm worried that we might end up moving a lot of data back and forth when using CUDA (Metal is not a problem due to unified memory) and hinder the performance. But I agree it would be much cleaner, so maybe as a first iteration we can do it like this and then look for improvements.

slaren · 2024-01-14T13:21:57Z

Performance with CUDA will be good, the overhead will actually be lower than with Metal or even CPU.

With CUDA, evaluation is asynchronous and a synchronization will only happen if the user calls ggml_backend_tensor_get to read the result, so the performance should be good for cases where the user is only interested in a few tensors and ignores the rest

ggerganov · 2024-01-14T13:24:46Z

Thanks, I was too quick to respond and missed your point. Sounds great

Edit: let me give this a try

ggerganov · 2024-01-14T14:56:42Z

The PoC is here #4935 - seems to work great and pretty easy to add. As expected, Metal slows down quite a lot due to having to start and stop the computation for each node. However, for CUDA I don't observe any significant slowdowns

ggerganov · 2024-01-14T15:32:57Z

I updated the callback to "ask" the user if they are interested in the data of a particular node. This way, the scheduler can now group nodes that the user does not want to observe into a single compute call. This fixes the performance with Metal

ikawrakow added the enhancement New feature or request label Jan 14, 2024

ggerganov self-assigned this Jan 14, 2024

ggerganov mentioned this issue Jan 14, 2024

backend : add eval callback #4935

Merged

ggerganov mentioned this issue Jan 15, 2024

imatrix : offload to GPU support #4957

Merged

ggerganov closed this as completed in #4957 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add importance matrix calculation to non-CPU back-ends #4931

Add importance matrix calculation to non-CPU back-ends #4931

ikawrakow commented Jan 14, 2024

ggerganov commented Jan 14, 2024

slaren commented Jan 14, 2024

ggerganov commented Jan 14, 2024

slaren commented Jan 14, 2024

ggerganov commented Jan 14, 2024 •

edited

Loading

ggerganov commented Jan 14, 2024

ggerganov commented Jan 14, 2024

Add importance matrix calculation to non-CPU back-ends #4931

Add importance matrix calculation to non-CPU back-ends #4931

Comments

ikawrakow commented Jan 14, 2024

ggerganov commented Jan 14, 2024

slaren commented Jan 14, 2024

ggerganov commented Jan 14, 2024

slaren commented Jan 14, 2024

ggerganov commented Jan 14, 2024 • edited Loading

ggerganov commented Jan 14, 2024

ggerganov commented Jan 14, 2024

ggerganov commented Jan 14, 2024 •

edited

Loading