Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a test that highlight a serialization bug #61

Merged
merged 1 commit into from
Sep 10, 2020

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Sep 9, 2020

Hello,

I found a bug while using roaring in the next version of the MeiliSearch engine, it appears that the serialize function doesn't correctly represent the integers in the serialized format.

I haven't had time to dive into the serialise function yet.
Could you take a look please?

Note that the original integers array was 77MB as an array text, I found a way to reduce that to 4096 values but there surely is a smaller input than this.

Thank you for your help :)

@lemire
Copy link
Member

lemire commented Sep 9, 2020

Could you post what gets deserialized? It might be possible to identify the bug just by looking at it.

@Kerollmops
Copy link
Member Author

Kerollmops commented Sep 9, 2020

I can give you the debug display if you want:

 left: `RoaringBitmap<4097 values between 6619162 and 108658947>`,
right: `RoaringBitmap<4097 values between 6619137 and 108658947>`', tests/serialization.rs:96:5

Do you want more?

@lemire
Copy link
Member

lemire commented Sep 9, 2020

That's helpful because it points at an issue in the deserialization of the bitmap containers.

@Kerollmops
Copy link
Member Author

What makes you think that? Why can't it be a serialization bug?

@lemire
Copy link
Member

lemire commented Sep 9, 2020

What makes you think that? Why can't it be a serialization bug?

Well, whether it is serialization or deserialization is impossible to tell without further work but what the numbers you got suggest is that it has to do with bitmap containers.

@lemire
Copy link
Member

lemire commented Sep 9, 2020

Yep. It is deserialization bug... see my fix at...

#62

So it is a tiny thing. By definition of the Roaring Bitmap format, if you have 4096 elements or less, you have an array container. But roaring-rs, at the deserialization level, had a "<" instead of a "<=" so it is an "off by one" error. My patch fixes that.

(There is a formal specification where this is spelled out.)

@lemire
Copy link
Member

lemire commented Sep 9, 2020

So why did I expected a deserialization bug? Because I expected the data structure to be fine (since otherwise you'd get in trouble quickly) and because the serialization/deserialization test have lots of checks during the serialization (e.g., the code tests that the right number of bytes were written) but relatively fewer tests during the deserialization. Of course, it could still have been at the serialization...

@lemire
Copy link
Member

lemire commented Sep 9, 2020

(For clarification, I did not write a single line of code in roaring-rs. I just happen to know the Roaring format well)

@Kerollmops
Copy link
Member Author

Wow!

Thank you very much for this quick fix.
I don't think I could have found it as fast as you.

@lemire
Copy link
Member

lemire commented Sep 9, 2020

@Kerollmops Quick! We want to know how well this library is serving you!

bors bot added a commit that referenced this pull request Sep 10, 2020
62: Bug fix for serialization bug r=Nemo157 a=lemire

See #61

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Daniel Lemire <lemire@gmail.com>
@bors bors bot merged commit 6c1b644 into RoaringBitmap:master Sep 10, 2020
@Kerollmops Kerollmops deleted the de-serialize-bug branch September 10, 2020 11:06
@Kerollmops
Copy link
Member Author

Hey @lemire,

Wasn't sure if you asked me to give some feedback, so I ask friends!

So yeah, I am pretty happy with the library so far, it is fast and easy to use, I have just some little remarks about the library:

The serialized format is too heavy for little sets (less than 100 ints), I achived to gain 26% on memory when I serialized by hand the few integers I had (like a simple slice), the headers are really heavy. The payload was simply smaller than the headers.

I also needed a function to do an operation on multiple bitmaps at the same time, I know that there exist a multi-intersection on the original C library, but didn't check for a multi-is-disjoint function. I also tried to implement the multi-union but performances weren't there, not sure why.

Thank you both for this library :)

@lemire
Copy link
Member

lemire commented Sep 10, 2020

The serialized format is too heavy for little sets (less than 100 ints), I achived to gain 26% on memory when I serialized by hand the few integers I had (like a simple slice), the headers are really heavy. The payload was simply smaller than the headers.

Right. The Roaring Bitmap format tries to account for many different goals at once, and producing a concise output is only one of them. Furthermore, I think that roaring-rs does not support run containers, which is often a good way to compress the data.

I also needed a function to do an operation on multiple bitmaps at the same time, I know that there exist a multi-intersection on the original C library, but didn't check for a multi-is-disjoint function. I also tried to implement the multi-union but performances weren't there, not sure why.

Given that I only have a passing familiarity with roaring-rs, I can only comment in broad terms.

  1. Assuming that it is not already in the library, writing a "is disjoint?" function (as opposed to computing the intersection and checking that it has a size of zero) is easy and can be beneficial. If it does not already exist, writing it ought to be very easy.

  2. Doing a multi-bitmap intersection via a dedicated function is probably not so beneficial. Intersections over many bitmaps can be done efficiently using a naive two-by-two intersection, most of the time. If you have few bitmaps, start from the smallest one.

  3. For the multi-union, the best approach in practice is again, often, a two-by-two union, but using the concept of a "lazy" union. The idea is to do the union two bitmaps by two bitmaps, but to produce an intermediate result which is not, strictly speaking, a valid roaring bitmap. Then when you are done with the whole computation, you can scan the result and make it a valid bitmap again. You can also use a priority queue, which can be better in some instances, but it is both more complicated, and more memory intensive.

There are lots of little performance tricks, and you will find them in the Go, Java and C versions of Roaring. I do not know how many of these tricks roaring-rs captured. There may be quite a few missing optimization opportunities left. Given that these techniques have been documented, implemented and tested out, it should be easy to add them to Rust.

(Note that I am not an author of roaring-rs.)

@lemire
Copy link
Member

lemire commented Sep 10, 2020

For a pretty cool "fast way" to do multi-intersection, please see this code...
https://github.com/RoaringBitmap/RoaringBitmap/blob/master/RoaringBitmap/src/main/java/org/roaringbitmap/FastAggregation.java#L293

@Kerollmops
Copy link
Member Author

Kerollmops commented Sep 10, 2020

Furthermore, I think that roaring-rs does not support run containers, which is often a good way to compress the data.

What the difference between bitset and run containers? I can maybe find that in the search paper that I already read... Is it only a way to compress the data a little more or does it also give performances improvements?

  1. Assuming that it is not already in the library, writing a "is disjoint?" function (as opposed to computing the intersection and checking that it has a size of zero) is easy and can be beneficial. If it does not already exist, writing it ought to be very easy.

The roaring-rs library already has the is_disjoint method but doesn't support any multi-bitmaps functions. I was pretty sure that using an is_disjoint for multiple bitmaps would be much faster than computing the intermediate intersections and checking if the final intersection is empty. Not sure if it is easy to implement or even possible.

  1. Doing a multi-bitmap intersection via a dedicated function is probably not so beneficial. Intersections over many bitmaps can be done efficiently using a naive two-by-two intersection, most of the time. If you have few bitmaps, start from the smallest one.

Yeah was already doing that, sorting them by size, this is much faster indeed. I also find a way to stop retrieving bitmaps from the kv-store when the latest intermediate intersection produced an empty bitmap. But this can't be integrated into roaring.

  1. For the multi-union, the best approach in practice is again, often, a two-by-two union, but using the concept of a "lazy" union. The idea is to do the union two bitmaps by two bitmaps, but to produce an intermediate result which is not, strictly speaking, a valid roaring bitmap. Then when you are done with the whole computation, you can scan the result and make it a valid bitmap again. You can also use a priority queue, which can be better in some instances, but it is both more complicated, and more memory intensive.

Yeah I read that in the C implementation if I remember well, only computing the final container type (bitset or array) at the end of the process, to avoid the type switch overhead. Unfortunatelly I wasn't able to make it fast enough in my PR, not sure why.

EDIT: Wow, didn't see your message on the PR.

Thank you for your help again.

@lemire
Copy link
Member

lemire commented Sep 10, 2020

What the difference between bitset and run containers? I can maybe find that in the search paper that I already read... Is it only a way to compress the data a little more or does it also give performances improvements?

It is quite different. There is already a PR in this repository for run containers, but it has not been merged.

@lemire
Copy link
Member

lemire commented Sep 10, 2020

I was pretty sure that using an is_disjoint for multiple bitmaps would be much faster than computing the intermediate intersections and checking if the final intersection is empty. Not sure if it is easy to implement or even possible.

This seems a bit speculative. It is best answered with a realistic benchmark and hard data.

@Kerollmops
Copy link
Member Author

Kerollmops commented Sep 10, 2020

I think I understand what could be a run container: 0..23 followed by 563..4000.
Only storing the start and end of runs!

@lemire
Copy link
Member

lemire commented Sep 10, 2020

But this can't be integrated into roaring.

I don't think it needs to.

@lemire
Copy link
Member

lemire commented Sep 10, 2020

Only storing the start and end of runs!

Correct.

@lemire
Copy link
Member

lemire commented Sep 10, 2020

Yeah I read that in the C implementation if I remember well, only computing the final container type (bitset or array) at the end of the process, to avoid the type switch overhead.

I would not put it quite like that.

We may actually trigger conversions that would be unneeded, deliberately.

A bitmap container requires 1024 words. Each time you do a union, you need something like 1024 operations. Less if you have (auto)vectorization. If you can count on 128-bit SIMD registers, it is something like 512 operations. This sounds a lot but it is entirely free of branches. Now, each mispredicted branch is ~15 cycles lost. So if you are doing a lot arrays-to-arrays union, you can quickly end up spending a lot more than 1024 cycles doing a single union... because it is filled with "hard to predict branches".

Note that if your benchmark is poorly constructed (synthetic data, small data), you won't see these effects. Use large and unpredictable data for sanity.

So the actual trick is to eagerly represent your data using bitmap containers (bitset) because then you get a really simple computational model with very straight-forward performance.

Of course, it is not black and white. So one needs to benchmark and tweak the heuristics.

not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
62: Bug fix for serialization bug r=Nemo157 a=lemire

See RoaringBitmap#61

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Daniel Lemire <lemire@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants