-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized implementation for string indexes #632
Conversation
I haven't had the time yet, but it would be great to have some numbers on the performance delta. Ideally for both space and time. |
There's a fundamental issue I overlooked in the design: the index does not support inequality lookups. It can't because it hashes the input. We'll have to reconsider the usefulness of a space-efficient equality-only string index before proceeding. |
ce68bfe
to
88b7f60
Compare
05fdebe
to
887eecb
Compare
This index is meant for opaque identifiers that are can only be retrieved via equality lookups. If a string type contains the #id attribute in a schema, the index factory chooses a new hash-based implementation. The hash index is a vector of digests, where each digest is trimmed to a specific byte length. This works only if the hash function distributes its value uniformly over the entire output bits. Whether our current choice of hash function (xxhash64) is an open question that needs to figured out.
The value index keeps two extra bitmaps independent of its implementation: a null bitmap and a mask bitmap. The null bitmap is for all nil values and the mask value used to be for all IDs that exist in the index *and* the nil values. This change makes the two bitmaps disjoint. The result is not only less bits overall, but also less bit-wise operations when performing a lookup, because the null bitmap no longer needs to be NANDed out.
a0dcc1c
to
3ca898e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of the first 3 commits only, so I can do this in chunks.
Looking good so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass. Conceptually, this seems all good to me, and I didn't notice anything weird in a short test run. Just a few small notes here and there for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed my review strategy to go file by file.
The commit ede727e seems to have triggered an ADL correctness issue, causing the wrong operator to be called and thus the wrong type name to be displayed. That should likely be fixed in a separate PR. |
This only popped up because of an overload in the Zeek type printer that didn't get triggered because it lacked a const qualifier.
@dominiklohmann Sorry, I fixed the issue already before seeing your request to do this separately. |
The new index is meant for opaque identifiers that can only be retrieved via equality lookups. If a string type contains the
#index=hash
attribute, the index factory chooses this new hash-based implementation.