Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another bucket sort #5109

Merged
merged 3 commits into from
Jan 26, 2024
Merged

Another bucket sort #5109

merged 3 commits into from
Jan 26, 2024

Conversation

ikawrakow
Copy link
Contributor

We now have 3 PR's related to sorting logits with the goal to speedup top-k sampling:

The table shows a comparison between master and these 3 PR's as a function of top_k. Tests run on a Ryzen-5975WX + RTX--4080 with Ubuntu 22.04 and GCC 11.4.0.

--top-k t/s master t/s (PR 5085) t/s PR 5101 t/s this PR Speedup this PR
100 7758 7758 3084 7758 1.000
200 6796 6796 3076 7121 1.048
500 4848 4848 2905 6733 1.389
1000 3353 3353 2872 6075 1.812
2000 2123 2123 2592 5006 2.358
4000 1288 2000 2246 3627 2.816
8000 783 1438 1718 2292 2.927
16000 508 921 1143 1326 2.611
31780 397 558 675 734 1.848
32000 559 559 703 773 1.384

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 24, 2024

If this continues we'll soon see a GPU tensor-core sorting kernel beating this one again :)
It's quite amazing to see how much performance can be gained.
On the other hand, I think pre-filtering K to a more meaningful value is still the best way to go in terms of practicality, if huge-top-k values are the future.

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jan 24, 2024

If this continues we'll soon see a GPU tensor-core sorting kernel beating this one again :) It's quite amazing to see how much performance can be gained. On the other hand, I think pre-filtering K to a more meaningful value is still the best way to go in terms of practicality, if huge-top-k values are the future.

Well, when the GPU beats this, there is still room for improvement. One can easily shave off another 10% or so from the time by having a top_k sampler instance that has the buffers pre-allocated, so one doesn't need to do memory allocations on each invocation of llama_sample_top_k. One can vectorize at least the 1st loop. Etc.

More seriously, I do agree with you that if the usage of large top_k becomes standard practice, it is better to prefilter the logits in some way.

@JohannesGaessler
Copy link
Collaborator

PR for a min_p implementation that works on unsorted tokens: #5115 .

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; the code has become more difficult to understand but in my opinion speed is more important unless the code is irrelevant for performance. But maybe we should wait for the opinion of another dev just to be sure.

I can confirm that the performance is better than both master and my bucket sort PR:

bucket_sort

Good job!

@ikawrakow ikawrakow merged commit 1182cf4 into master Jan 26, 2024
48 checks passed
@ikawrakow ikawrakow deleted the ik/bucket_sort branch January 26, 2024 07:14
This was referenced Jan 26, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* Initial bucket sort

* Bucket sort: slightly better version

* Bucket sort: another minor improvement

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Initial bucket sort

* Bucket sort: slightly better version

* Bucket sort: another minor improvement

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants