-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a test that highlight a serialization bug #61
Conversation
Could you post what gets deserialized? It might be possible to identify the bug just by looking at it. |
I can give you the debug display if you want:
Do you want more? |
That's helpful because it points at an issue in the deserialization of the bitmap containers. |
What makes you think that? Why can't it be a serialization bug? |
Well, whether it is serialization or deserialization is impossible to tell without further work but what the numbers you got suggest is that it has to do with bitmap containers. |
Yep. It is deserialization bug... see my fix at... So it is a tiny thing. By definition of the Roaring Bitmap format, if you have 4096 elements or less, you have an array container. But roaring-rs, at the deserialization level, had a "<" instead of a "<=" so it is an "off by one" error. My patch fixes that. (There is a formal specification where this is spelled out.) |
So why did I expected a deserialization bug? Because I expected the data structure to be fine (since otherwise you'd get in trouble quickly) and because the serialization/deserialization test have lots of checks during the serialization (e.g., the code tests that the right number of bytes were written) but relatively fewer tests during the deserialization. Of course, it could still have been at the serialization... |
(For clarification, I did not write a single line of code in roaring-rs. I just happen to know the Roaring format well) |
Wow! Thank you very much for this quick fix. |
@Kerollmops Quick! We want to know how well this library is serving you! |
Hey @lemire, Wasn't sure if you asked me to give some feedback, so I ask friends! So yeah, I am pretty happy with the library so far, it is fast and easy to use, I have just some little remarks about the library: The serialized format is too heavy for little sets (less than 100 ints), I achived to gain 26% on memory when I serialized by hand the few integers I had (like a simple slice), the headers are really heavy. The payload was simply smaller than the headers. I also needed a function to do an operation on multiple bitmaps at the same time, I know that there exist a multi-intersection on the original C library, but didn't check for a multi-is-disjoint function. I also tried to implement the multi-union but performances weren't there, not sure why. Thank you both for this library :) |
Right. The Roaring Bitmap format tries to account for many different goals at once, and producing a concise output is only one of them. Furthermore, I think that roaring-rs does not support run containers, which is often a good way to compress the data.
Given that I only have a passing familiarity with roaring-rs, I can only comment in broad terms.
There are lots of little performance tricks, and you will find them in the Go, Java and C versions of Roaring. I do not know how many of these tricks roaring-rs captured. There may be quite a few missing optimization opportunities left. Given that these techniques have been documented, implemented and tested out, it should be easy to add them to Rust. (Note that I am not an author of roaring-rs.) |
For a pretty cool "fast way" to do multi-intersection, please see this code... |
What the difference between bitset and run containers? I can maybe find that in the search paper that I already read... Is it only a way to compress the data a little more or does it also give performances improvements?
The roaring-rs library already has the
Yeah was already doing that, sorting them by size, this is much faster indeed. I also find a way to stop retrieving bitmaps from the kv-store when the latest intermediate intersection produced an empty bitmap. But this can't be integrated into roaring.
Yeah I read that in the C implementation if I remember well, only computing the final container type (bitset or array) at the end of the process, to avoid the type switch overhead. Unfortunatelly I wasn't able to make it fast enough in my PR, not sure why. EDIT: Wow, didn't see your message on the PR. Thank you for your help again. |
It is quite different. There is already a PR in this repository for run containers, but it has not been merged. |
This seems a bit speculative. It is best answered with a realistic benchmark and hard data. |
I think I understand what could be a run container: 0..23 followed by 563..4000. |
I don't think it needs to. |
Correct. |
I would not put it quite like that. We may actually trigger conversions that would be unneeded, deliberately. A bitmap container requires 1024 words. Each time you do a union, you need something like 1024 operations. Less if you have (auto)vectorization. If you can count on 128-bit SIMD registers, it is something like 512 operations. This sounds a lot but it is entirely free of branches. Now, each mispredicted branch is ~15 cycles lost. So if you are doing a lot arrays-to-arrays union, you can quickly end up spending a lot more than 1024 cycles doing a single union... because it is filled with "hard to predict branches". Note that if your benchmark is poorly constructed (synthetic data, small data), you won't see these effects. Use large and unpredictable data for sanity. So the actual trick is to eagerly represent your data using bitmap containers (bitset) because then you get a really simple computational model with very straight-forward performance. Of course, it is not black and white. So one needs to benchmark and tweak the heuristics. |
62: Bug fix for serialization bug r=Nemo157 a=lemire See RoaringBitmap#61 Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: Daniel Lemire <lemire@gmail.com>
Hello,
I found a bug while using roaring in the next version of the MeiliSearch engine, it appears that the serialize function doesn't correctly represent the integers in the serialized format.
I haven't had time to dive into the serialise function yet.
Could you take a look please?
Note that the original integers array was 77MB as an array text, I found a way to reduce that to 4096 values but there surely is a smaller input than this.
Thank you for your help :)