v1.18: implements weighted shuffle using binary tree (backport of #185) #425

mergify · 2024-03-26T05:33:22Z

This is partial port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Problem

Though Fenwick trees use less space, inverse queries require an
additional O(log n) factor for binary search resulting an overall
O(n log n log n) performance for weighted shuffle.

Summary of Changes

This commit instead uses a binary tree where each node contains the sum
of all weights in its left sub-tree. The weights themselves are
implicitly stored at the leaves. Inverse queries and updates to the tree
all can be done O(log n) resulting an overall O(n log n) weighted
shuffle implementation.

Based on benchmarks, this results in 24% improvement in
WeightedShuffle::shuffle:

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:      36,686 ns/iter (+/- 191)
test bench_weighted_shuffle_shuffle ... bench:     342,625 ns/iter (+/- 4,067)

Binary tree:

test bench_weighted_shuffle_new     ... bench:      59,131 ns/iter (+/- 362)
test bench_weighted_shuffle_shuffle ... bench:     260,194 ns/iter (+/- 11,195)

Though WeightedShuffle::new is now slower, it generally can be cached
and reused as in Turbine:
https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68

Additionally the new code has better asymptotic performance. For
example with 20_000 weights WeightedShuffle::shuffle is 31% faster:

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:     255,071 ns/iter (+/- 9,591)
test bench_weighted_shuffle_shuffle ... bench:   2,466,058 ns/iter (+/- 9,873)

Binary tree:

test bench_weighted_shuffle_new     ... bench:     830,727 ns/iter (+/- 10,210)
test bench_weighted_shuffle_shuffle ... bench:   1,696,160 ns/iter (+/- 75,271)<hr>This is an automatic backport of pull request #185 done by [Mergify](https://mergify.com).

This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271) (cherry picked from commit b6d2237)

codecov-commenter · 2024-03-26T07:02:26Z

Codecov Report

Attention: Patch coverage is 99.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (cc4f37e) to head (204b10b).

Additional details and impacted files

@@           Coverage Diff           @@
##            v1.18     #425   +/-   ##
=======================================
  Coverage    81.5%    81.6%           
=======================================
  Files         827      827           
  Lines      224706   224756   +50     
=======================================
+ Hits       183329   183401   +72     
+ Misses      41377    41355   -22

…a-xyz#185) (anza-xyz#425) implements weighted shuffle using binary tree (anza-xyz#185) This is partial port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c Though Fenwick trees use less space, inverse queries require an additional O(log n) factor for binary search resulting an overall O(n log n log n) performance for weighted shuffle. This commit instead uses a binary tree where each node contains the sum of all weights in its left sub-tree. The weights themselves are implicitly stored at the leaves. Inverse queries and updates to the tree all can be done O(log n) resulting an overall O(n log n) weighted shuffle implementation. Based on benchmarks, this results in 24% improvement in WeightedShuffle::shuffle: Fenwick tree: test bench_weighted_shuffle_new ... bench: 36,686 ns/iter (+/- 191) test bench_weighted_shuffle_shuffle ... bench: 342,625 ns/iter (+/- 4,067) Binary tree: test bench_weighted_shuffle_new ... bench: 59,131 ns/iter (+/- 362) test bench_weighted_shuffle_shuffle ... bench: 260,194 ns/iter (+/- 11,195) Though WeightedShuffle::new is now slower, it generally can be cached and reused as in Turbine: https://github.com/anza-xyz/agave/blob/b3fd87fe8/turbine/src/cluster_nodes.rs#L68 Additionally the new code has better asymptotic performance. For example with 20_000 weights WeightedShuffle::shuffle is 31% faster: Fenwick tree: test bench_weighted_shuffle_new ... bench: 255,071 ns/iter (+/- 9,591) test bench_weighted_shuffle_shuffle ... bench: 2,466,058 ns/iter (+/- 9,873) Binary tree: test bench_weighted_shuffle_new ... bench: 830,727 ns/iter (+/- 10,210) test bench_weighted_shuffle_shuffle ... bench: 1,696,160 ns/iter (+/- 75,271) (cherry picked from commit b6d2237) Co-authored-by: behzad nouri <behzadnouri@gmail.com>

mergify bot assigned behzadnouri Mar 26, 2024

behzadnouri requested a review from gregcusack March 26, 2024 07:06

behzadnouri added the automerge automerge Merge this Pull Request automatically once CI passes label Mar 26, 2024

gregcusack approved these changes Mar 26, 2024

View reviewed changes

mergify bot merged commit e356a44 into v1.18 Mar 26, 2024
36 checks passed

mergify bot deleted the mergify/bp/v1.18/pr-185 branch March 26, 2024 07:10

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore #475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.18: implements weighted shuffle using binary tree (backport of #185) #425

v1.18: implements weighted shuffle using binary tree (backport of #185) #425

mergify bot commented Mar 26, 2024

codecov-commenter commented Mar 26, 2024

v1.18: implements weighted shuffle using binary tree (backport of #185) #425

v1.18: implements weighted shuffle using binary tree (backport of #185) #425

Conversation

mergify bot commented Mar 26, 2024

Problem

Summary of Changes

codecov-commenter commented Mar 26, 2024

Codecov Report