Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up bitmap iteration #125

Merged
merged 10 commits into from
Jan 6, 2022
Merged

Speed up bitmap iteration #125

merged 10 commits into from
Jan 6, 2022

Conversation

saik0
Copy link
Contributor

@saik0 saik0 commented Dec 28, 2021

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it looks better, in terms of performance and in code clarity.
Thank you!
bors merge

bors bot added a commit that referenced this pull request Jan 3, 2022
125: Speed up bitmap iteration r=Kerollmops a=saik0

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

```
iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
```

Co-authored-by: saik0 <github@saik0.net>
@bors
Copy link
Contributor

bors bot commented Jan 3, 2022

Build failed:

@Kerollmops
Copy link
Member

Hey @saik0,

Could you please do a little cargo fmt please?

@Kerollmops
Copy link
Member

Thank you!
bors merge

bors bot added a commit that referenced this pull request Jan 4, 2022
125: Speed up bitmap iteration r=Kerollmops a=saik0

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

```
iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
```

Co-authored-by: saik0 <github@saik0.net>
@bors
Copy link
Contributor

bors bot commented Jan 4, 2022

Build failed:

@Kerollmops
Copy link
Member

Kerollmops commented Jan 4, 2022

Hum... It seems like you should use the latest version of cargo fmt, or even the nightly version.

@saik0
Copy link
Contributor Author

saik0 commented Jan 4, 2022

Ah, benches is a different crate. I had to run clippy and format on it as well.

@Kerollmops
Copy link
Member

Thank you :)
Bors merge

bors bot added a commit that referenced this pull request Jan 4, 2022
125: Speed up bitmap iteration r=Kerollmops a=saik0

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

```
iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
```

Co-authored-by: saik0 <github@saik0.net>
@bors
Copy link
Contributor

bors bot commented Jan 4, 2022

Build failed:

tests/iter.rs Outdated
fn qc_iter(values: Vec<u32>) {
let expected = {
let mut vec = values.clone();
vec.sort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vec.sort();
vec.sort_unstable();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy is not happy about using sort instead of sort_unstable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. BTreeSet impls Arbitrary and is sorted and unique. Iterator has an eq method that does an element-wise comparison.

@Kerollmops
Copy link
Member

Thank you for the changes, it indeed makes the code more clear!
bors merge

@bors
Copy link
Contributor

bors bot commented Jan 6, 2022

Build succeeded:

@bors bors bot merged commit 4f9a119 into RoaringBitmap:master Jan 6, 2022
bors bot added a commit that referenced this pull request Jan 12, 2022
127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0

### Purpose

This PR adds some optimizations from CRoaring as outlined in  arXiv:1709.07821 section 3

### Overview 
 * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring)
 * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in #125
 * Implement Array-Bitset aggregates as outlined in section 3.2
   * Also branchless 😎
 * Tracks bitmap cardinality while performing bitmap-bitmap ops
   * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready
   * Curious to hear what you think about this `@lemire` 
 * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap`
   * This is unfortunately a cross cutting change
 * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops`
   * `Bitmap8K` keeps track of it's cardinality
   * `SortedU16Vec` maintains its sorting
   * `Store` now only delegates to these containers
   * My hope is that this will be useful when implementing run containers. 🤞
   * Unfortunately so much code was  moved this PR is _HUGE_


### Out of scope
 * Inline ASM for Array-Bitset aggregates
 * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned

### Notes
 * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
This was referenced Feb 5, 2022
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
125: Speed up bitmap iteration r=Kerollmops a=saik0

A low hanging fruit I found:

Bitmap iteration currently tests every bit for each non-zero element

This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.

```
iter bitmap 1..10_000   time:   [48.602 us 48.787 us 48.978 us]                                   
                        change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

iter bitmap sparse      time:   [4.8238 us 4.8478 us 4.8737 us]                                
                        change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

iter bitmap dense       time:   [160.14 us 160.73 us 161.29 us]                              
                        change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter bitmap minimal     time:   [18.447 us 18.512 us 18.579 us]                                 
                        change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

iter bitmap full        time:   [312.97 us 314.10 us 315.35 us]                             
                        change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking iter parsed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
iter parsed             time:   [1.5311 ms 1.5371 ms 1.5436 ms]                         
                        change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
```

Co-authored-by: saik0 <github@saik0.net>
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0

### Purpose

This PR adds some optimizations from CRoaring as outlined in  arXiv:1709.07821 section 3

### Overview 
 * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring)
 * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in RoaringBitmap#125
 * Implement Array-Bitset aggregates as outlined in section 3.2
   * Also branchless 😎
 * Tracks bitmap cardinality while performing bitmap-bitmap ops
   * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready
   * Curious to hear what you think about this `@lemire` 
 * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap`
   * This is unfortunately a cross cutting change
 * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops`
   * `Bitmap8K` keeps track of it's cardinality
   * `SortedU16Vec` maintains its sorting
   * `Store` now only delegates to these containers
   * My hope is that this will be useful when implementing run containers. 🤞
   * Unfortunately so much code was  moved this PR is _HUGE_


### Out of scope
 * Inline ASM for Array-Bitset aggregates
 * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned

### Notes
 * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants