Another bucket sort #5109

ikawrakow · 2024-01-24T13:35:09Z

We now have 3 PR's related to sorting logits with the goal to speedup top-k sampling:

PR top-k sort speedup #5085 - speed up via usage of std::nth_element before std::partial_sort
PR Use bucket sort for token logits #5101, which introduces a bucket sort algorithm
This PR, which provides a better (faster) implementation of the bucket sort proposed in Use bucket sort for token logits #5101

The table shows a comparison between master and these 3 PR's as a function of top_k. Tests run on a Ryzen-5975WX + RTX--4080 with Ubuntu 22.04 and GCC 11.4.0.

--top-k	t/s master	t/s (PR 5085)	t/s PR 5101	t/s this PR	Speedup this PR
100	7758	7758	3084	7758	1.000
200	6796	6796	3076	7121	1.048
500	4848	4848	2905	6733	1.389
1000	3353	3353	2872	6075	1.812
2000	2123	2123	2592	5006	2.358
4000	1288	2000	2246	3627	2.816
8000	783	1438	1718	2292	2.927
16000	508	921	1143	1326	2.611
31780	397	558	675	734	1.848
32000	559	559	703	773	1.384

cmp-nct · 2024-01-24T14:35:20Z

If this continues we'll soon see a GPU tensor-core sorting kernel beating this one again :)
It's quite amazing to see how much performance can be gained.
On the other hand, I think pre-filtering K to a more meaningful value is still the best way to go in terms of practicality, if huge-top-k values are the future.

ikawrakow · 2024-01-24T15:11:33Z

If this continues we'll soon see a GPU tensor-core sorting kernel beating this one again :) It's quite amazing to see how much performance can be gained. On the other hand, I think pre-filtering K to a more meaningful value is still the best way to go in terms of practicality, if huge-top-k values are the future.

Well, when the GPU beats this, there is still room for improvement. One can easily shave off another 10% or so from the time by having a top_k sampler instance that has the buffers pre-allocated, so one doesn't need to do memory allocations on each invocation of llama_sample_top_k. One can vectorize at least the 1st loop. Etc.

More seriously, I do agree with you that if the usage of large top_k becomes standard practice, it is better to prefilter the logits in some way.

JohannesGaessler · 2024-01-24T16:36:02Z

PR for a min_p implementation that works on unsorted tokens: #5115 .

JohannesGaessler

LGTM; the code has become more difficult to understand but in my opinion speed is more important unless the code is irrelevant for performance. But maybe we should wait for the opinion of another dev just to be sure.

I can confirm that the performance is better than both master and my bucket sort PR:

Good job!

* Initial bucket sort * Bucket sort: slightly better version * Bucket sort: another minor improvement --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 3 commits January 24, 2024 14:26

Initial bucket sort

90c1db8

Bucket sort: slightly better version

e7c1f64

Bucket sort: another minor improvement

26dde91

JohannesGaessler mentioned this pull request Jan 24, 2024

Apply min_p to unsorted tokens #5115

Merged

JohannesGaessler approved these changes Jan 24, 2024

View reviewed changes

ikawrakow merged commit 1182cf4 into master Jan 26, 2024
48 checks passed

ikawrakow deleted the ik/bucket_sort branch January 26, 2024 07:14

This was referenced Jan 26, 2024

top-k sort speedup #5085

Closed

Use bucket sort for token logits #5101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another bucket sort #5109

Another bucket sort #5109

ikawrakow commented Jan 24, 2024

cmp-nct commented Jan 24, 2024

ikawrakow commented Jan 24, 2024 •

edited

Loading

JohannesGaessler commented Jan 24, 2024

JohannesGaessler left a comment

Another bucket sort #5109

Another bucket sort #5109

Conversation

ikawrakow commented Jan 24, 2024

cmp-nct commented Jan 24, 2024

ikawrakow commented Jan 24, 2024 • edited Loading

JohannesGaessler commented Jan 24, 2024

JohannesGaessler left a comment

Choose a reason for hiding this comment

ikawrakow commented Jan 24, 2024 •

edited

Loading