Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.18: implements weighted shuffle using binary tree (backport of #185) #425

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Mar 26, 2024

This is partial port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Problem

Though Fenwick trees use less space, inverse queries require an
additional O(log n) factor for binary search resulting an overall
O(n log n log n) performance for weighted shuffle.

Summary of Changes

This commit instead uses a binary tree where each node contains the sum
of all weights in its left sub-tree. The weights themselves are
implicitly stored at the leaves. Inverse queries and updates to the tree
all can be done O(log n) resulting an overall O(n log n) weighted
shuffle implementation.

Based on benchmarks, this results in 24% improvement in
WeightedShuffle::shuffle:

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:      36,686 ns/iter (+/- 191)
test bench_weighted_shuffle_shuffle ... bench:     342,625 ns/iter (+/- 4,067)

Binary tree:

test bench_weighted_shuffle_new     ... bench:      59,131 ns/iter (+/- 362)
test bench_weighted_shuffle_shuffle ... bench:     260,194 ns/iter (+/- 11,195)

Though WeightedShuffle::new is now slower, it generally can be cached
and reused as in Turbine:
https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68

Additionally the new code has better asymptotic performance. For
example with 20_000 weights WeightedShuffle::shuffle is 31% faster:

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:     255,071 ns/iter (+/- 9,591)
test bench_weighted_shuffle_shuffle ... bench:   2,466,058 ns/iter (+/- 9,873)

Binary tree:

test bench_weighted_shuffle_new     ... bench:     830,727 ns/iter (+/- 10,210)
test bench_weighted_shuffle_shuffle ... bench:   1,696,160 ns/iter (+/- 75,271)<hr>This is an automatic backport of pull request #185 done by [Mergify](https://mergify.com).

This is partial port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Though Fenwick trees use less space, inverse queries require an
additional O(log n) factor for binary search resulting an overall
O(n log n log n) performance for weighted shuffle.

This commit instead uses a binary tree where each node contains the sum
of all weights in its left sub-tree. The weights themselves are
implicitly stored at the leaves. Inverse queries and updates to the tree
all can be done O(log n) resulting an overall O(n log n) weighted
shuffle implementation.

Based on benchmarks, this results in 24% improvement in
WeightedShuffle::shuffle:

Fenwick tree:
    test bench_weighted_shuffle_new     ... bench:      36,686 ns/iter (+/- 191)
    test bench_weighted_shuffle_shuffle ... bench:     342,625 ns/iter (+/- 4,067)

Binary tree:
    test bench_weighted_shuffle_new     ... bench:      59,131 ns/iter (+/- 362)
    test bench_weighted_shuffle_shuffle ... bench:     260,194 ns/iter (+/- 11,195)

Though WeightedShuffle::new is now slower, it generally can be cached
and reused as in Turbine:
https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68

Additionally the new code has better asymptotic performance. For
example with 20_000 weights WeightedShuffle::shuffle is 31% faster:

Fenwick tree:
    test bench_weighted_shuffle_new     ... bench:     255,071 ns/iter (+/- 9,591)
    test bench_weighted_shuffle_shuffle ... bench:   2,466,058 ns/iter (+/- 9,873)

Binary tree:
    test bench_weighted_shuffle_new     ... bench:     830,727 ns/iter (+/- 10,210)
    test bench_weighted_shuffle_shuffle ... bench:   1,696,160 ns/iter (+/- 75,271)

(cherry picked from commit b6d2237)
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 99.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (cc4f37e) to head (204b10b).

Additional details and impacted files
@@           Coverage Diff           @@
##            v1.18     #425   +/-   ##
=======================================
  Coverage    81.5%    81.6%           
=======================================
  Files         827      827           
  Lines      224706   224756   +50     
=======================================
+ Hits       183329   183401   +72     
+ Misses      41377    41355   -22     

@behzadnouri behzadnouri added the automerge automerge Merge this Pull Request automatically once CI passes label Mar 26, 2024
@mergify mergify bot merged commit e356a44 into v1.18 Mar 26, 2024
36 checks passed
@mergify mergify bot deleted the mergify/bp/v1.18/pr-185 branch March 26, 2024 07:10
anwayde pushed a commit to firedancer-io/agave that referenced this pull request Jul 23, 2024
…a-xyz#185) (anza-xyz#425)

implements weighted shuffle using binary tree (anza-xyz#185)

This is partial port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Though Fenwick trees use less space, inverse queries require an
additional O(log n) factor for binary search resulting an overall
O(n log n log n) performance for weighted shuffle.

This commit instead uses a binary tree where each node contains the sum
of all weights in its left sub-tree. The weights themselves are
implicitly stored at the leaves. Inverse queries and updates to the tree
all can be done O(log n) resulting an overall O(n log n) weighted
shuffle implementation.

Based on benchmarks, this results in 24% improvement in
WeightedShuffle::shuffle:

Fenwick tree:
    test bench_weighted_shuffle_new     ... bench:      36,686 ns/iter (+/- 191)
    test bench_weighted_shuffle_shuffle ... bench:     342,625 ns/iter (+/- 4,067)

Binary tree:
    test bench_weighted_shuffle_new     ... bench:      59,131 ns/iter (+/- 362)
    test bench_weighted_shuffle_shuffle ... bench:     260,194 ns/iter (+/- 11,195)

Though WeightedShuffle::new is now slower, it generally can be cached
and reused as in Turbine:
https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68

Additionally the new code has better asymptotic performance. For
example with 20_000 weights WeightedShuffle::shuffle is 31% faster:

Fenwick tree:
    test bench_weighted_shuffle_new     ... bench:     255,071 ns/iter (+/- 9,591)
    test bench_weighted_shuffle_shuffle ... bench:   2,466,058 ns/iter (+/- 9,873)

Binary tree:
    test bench_weighted_shuffle_new     ... bench:     830,727 ns/iter (+/- 10,210)
    test bench_weighted_shuffle_shuffle ... bench:   1,696,160 ns/iter (+/- 75,271)

(cherry picked from commit b6d2237)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge automerge Merge this Pull Request automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants