Speed up bitmap iteration #125

saik0 · 2021-12-28T15:53:57Z

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Kerollmops

Indeed, it looks better, in terms of performance and in code clarity.
Thank you!
bors merge

125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>

bors · 2022-01-03T15:12:09Z

Build failed:

ci (stable)

Kerollmops · 2022-01-03T15:13:33Z

Hey @saik0,

Could you please do a little cargo fmt please?

Kerollmops · 2022-01-04T12:20:55Z

Thank you!
bors merge

125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>

bors · 2022-01-04T12:24:37Z

Build failed:

ci (stable)

Kerollmops · 2022-01-04T12:48:04Z

Hum... It seems like you should use the latest version of cargo fmt, or even the nightly version.

saik0 · 2022-01-04T20:52:15Z

Ah, benches is a different crate. I had to run clippy and format on it as well.

Kerollmops · 2022-01-04T21:09:20Z

Thank you :)
Bors merge

125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>

bors · 2022-01-04T21:13:17Z

Build failed:

ci (stable)

Kerollmops · 2022-01-05T08:25:01Z

tests/iter.rs

+fn qc_iter(values: Vec<u32>) {
+    let expected = {
+        let mut vec = values.clone();
+        vec.sort();


Suggested change

vec.sort();

vec.sort_unstable();

Clippy is not happy about using sort instead of sort_unstable.

Oh. BTreeSet impls Arbitrary and is sorted and unique. Iterator has an eq method that does an element-wise comparison.

Kerollmops · 2022-01-06T09:56:07Z

Thank you for the changes, it indeed makes the code more clear!
bors merge

bors · 2022-01-06T10:00:11Z

Build succeeded:

ci (stable)

127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0 ### Purpose This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3 ### Overview * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring) * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in #125 * Implement Array-Bitset aggregates as outlined in section 3.2 * Also branchless 😎 * Tracks bitmap cardinality while performing bitmap-bitmap ops * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready * Curious to hear what you think about this `@lemire` * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap` * This is unfortunately a cross cutting change * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops` * `Bitmap8K` keeps track of it's cardinality * `SortedU16Vec` maintains its sorting * `Store` now only delegates to these containers * My hope is that this will be useful when implementing run containers. 🤞 * Unfortunately so much code was moved this PR is _HUGE_ ### Out of scope * Inline ASM for Array-Bitset aggregates * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned ### Notes * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016 Co-authored-by: saik0 <github@saik0.net> Co-authored-by: Joel Pedraza <github@saik0.net>

125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>

127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0 ### Purpose This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3 ### Overview * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring) * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in RoaringBitmap#125 * Implement Array-Bitset aggregates as outlined in section 3.2 * Also branchless 😎 * Tracks bitmap cardinality while performing bitmap-bitmap ops * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready * Curious to hear what you think about this `@lemire` * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap` * This is unfortunately a cross cutting change * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops` * `Bitmap8K` keeps track of it's cardinality * `SortedU16Vec` maintains its sorting * `Store` now only delegates to these containers * My hope is that this will be useful when implementing run containers. 🤞 * Unfortunately so much code was moved this PR is _HUGE_ ### Out of scope * Inline ASM for Array-Bitset aggregates * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned ### Notes * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016 Co-authored-by: saik0 <github@saik0.net> Co-authored-by: Joel Pedraza <github@saik0.net>

saik0 added 4 commits December 28, 2021 06:14

speed up iteration

960dd49

add iter quickcheck

375d534

fix formatting

d02de84

add benchmarks

478361a

Kerollmops approved these changes Jan 3, 2022

View reviewed changes

saik0 added 3 commits January 4, 2022 01:54

fix formatting

1185654

remove .idea dir

278c2b8

remove unused import

8730e0b

saik0 mentioned this pull request Jan 4, 2022

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

Merged

fix formatting and ignore warning in benchmark

837c787

Kerollmops requested changes Jan 5, 2022

View reviewed changes

saik0 added 2 commits January 5, 2022 16:22

replace vec sort / dedup with BTreeSet in iter qc test

bfb651a

use blsr equivalent instruction in iter

6d98cf8

bors bot merged commit 4f9a119 into RoaringBitmap:master Jan 6, 2022

This was referenced Feb 5, 2022

Perf regression: Iteration #176

Closed

Constant time len #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bitmap iteration #125

Speed up bitmap iteration #125

saik0 commented Dec 28, 2021 •

edited

Loading

Kerollmops left a comment

bors bot commented Jan 3, 2022

Kerollmops commented Jan 3, 2022

Kerollmops commented Jan 4, 2022

bors bot commented Jan 4, 2022

Kerollmops commented Jan 4, 2022 •

edited

Loading

saik0 commented Jan 4, 2022

Kerollmops commented Jan 4, 2022

bors bot commented Jan 4, 2022

Kerollmops Jan 5, 2022

Kerollmops Jan 5, 2022

saik0 Jan 6, 2022

Kerollmops commented Jan 6, 2022

bors bot commented Jan 6, 2022

Speed up bitmap iteration #125

Speed up bitmap iteration #125

Conversation

saik0 commented Dec 28, 2021 • edited Loading

Kerollmops left a comment

Choose a reason for hiding this comment

bors bot commented Jan 3, 2022

Kerollmops commented Jan 3, 2022

Kerollmops commented Jan 4, 2022

bors bot commented Jan 4, 2022

Kerollmops commented Jan 4, 2022 • edited Loading

saik0 commented Jan 4, 2022

Kerollmops commented Jan 4, 2022

bors bot commented Jan 4, 2022

Kerollmops Jan 5, 2022

Choose a reason for hiding this comment

Kerollmops Jan 5, 2022

Choose a reason for hiding this comment

saik0 Jan 6, 2022

Choose a reason for hiding this comment

Kerollmops commented Jan 6, 2022

bors bot commented Jan 6, 2022

saik0 commented Dec 28, 2021 •

edited

Loading

Kerollmops commented Jan 4, 2022 •

edited

Loading