-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add importance matrix calculation to non-CPU back-ends #4931
Comments
From an API standpoint, we should be able to pass the callback through the cc @slaren for insights |
I think this is a good solution, the only change I would make to this is having the user receive a
|
I'm worried that we might end up moving a lot of data back and forth when using CUDA (Metal is not a problem due to unified memory) and hinder the performance. But I agree it would be much cleaner, so maybe as a first iteration we can do it like this and then look for improvements. |
Performance with CUDA will be good, the overhead will actually be lower than with Metal or even CPU.
|
Thanks, I was too quick to respond and missed your point. Sounds great Edit: let me give this a try |
The PoC is here #4935 - seems to work great and pretty easy to add. As expected, Metal slows down quite a lot due to having to start and stop the computation for each node. However, for CUDA I don't observe any significant slowdowns |
I updated the callback to "ask" the user if they are interested in the data of a particular node. This way, the scheduler can now group nodes that the user does not want to observe into a single compute call. This fixes the performance with Metal |
The
imatrix
tool, which computes an "importance matrix" that can be used to improve quantization accuracy, currently only works when run on the CPU, which is quite slow. In addition, whenllama.cpp
is built with CUDA support enabled, the call to the data collection function is bypassed, and one gets an empty result, which is inconvenient and leads to confusion.Also, given the discussions around PRs #4897, #4861, #4856, #4773, where importance matrix capabilities were added to
llama.cpp
, there appears to be a lot of interest in experimenting with different training dataset to create the importance matrix. But experimentation is difficult with the much lower CPU performance compared to the GPU.So, overall, it would be very useful to support importance matrix calculations on faster back-ends (CUDA, Metal, etc.).
The text was updated successfully, but these errors were encountered: