Add optimized implementation for string indexes #632

mavam · 2019-11-01T22:05:51Z

The new index is meant for opaque identifiers that can only be retrieved via equality lookups. If a string type contains the #index=hash attribute, the index factory chooses this new hash-based implementation.

mavam · 2019-11-01T22:19:58Z

I haven't had the time yet, but it would be great to have some numbers on the performance delta. Ideally for both space and time.

mavam · 2019-11-02T12:26:48Z

There's a fundamental issue I overlooked in the design: the index does not support inequality lookups. It can't because it hashes the input. We'll have to reconsider the usefulness of a space-efficient equality-only string index before proceeding.

libvast/vast/hash_index.hpp

This index is meant for opaque identifiers that are can only be retrieved via equality lookups. If a string type contains the #id attribute in a schema, the index factory chooses a new hash-based implementation. The hash index is a vector of digests, where each digest is trimmed to a specific byte length. This works only if the hash function distributes its value uniformly over the entire output bits. Whether our current choice of hash function (xxhash64) is an open question that needs to figured out.

The value index keeps two extra bitmaps independent of its implementation: a null bitmap and a mask bitmap. The null bitmap is for all nil values and the mask value used to be for all IDs that exist in the index *and* the nil values. This change makes the two bitmaps disjoint. The result is not only less bits overall, but also less bit-wise operations when performing a lookup, because the null bitmap no longer needs to be NANDed out.

tobim

Review of the first 3 commits only, so I can do this in chunks.

Looking good so far.

libvast/src/value_index_factory.cpp

dominiklohmann

I've done a first pass. Conceptually, this seems all good to me, and I didn't notice anything weird in a short test run. Just a few small notes here and there for now.

libvast/src/format/zeek.cpp

libvast/src/view.cpp

libvast/vast/view.hpp

libvast/vast/hash_index.hpp

tobim

I changed my review strategy to go file by file.

libvast/src/format/zeek.cpp

libvast/src/value_index_factory.cpp

libvast/vast/value_index.hpp

libvast/src/value_index_factory.cpp

dominiklohmann · 2020-01-09T14:48:40Z

The commit ede727e seems to have triggered an ADL correctness issue, causing the wrong operator to be called and thus the wrong type name to be displayed. That should likely be fixed in a separate PR.

This only popped up because of an overload in the Zeek type printer that didn't get triggered because it lacked a const qualifier.

mavam · 2020-01-09T16:39:34Z

@dominiklohmann Sorry, I fixed the issue already before seeing your request to do this separately.

mavam requested a review from a team November 1, 2019 22:13

mavam added enhancement ✨ performance Improvements or regressions of performance labels Nov 1, 2019

mavam force-pushed the story/ch4493 branch 3 times, most recently from ce68bfe to 88b7f60 Compare November 18, 2019 20:55

mavam added feature New functionality and removed enhancement ✨ labels Nov 18, 2019

mavam changed the title ~~Add optimized implementation of opaque string indexes~~ Add optimized implementation for string indexes Nov 18, 2019

mavam force-pushed the story/ch4493 branch from fe5979b to 55929bf Compare November 19, 2019 20:39

mavam commented Nov 19, 2019

View reviewed changes

libvast/vast/hash_index.hpp Outdated Show resolved Hide resolved

mavam force-pushed the story/ch4493 branch from bbcdf29 to 47d8b71 Compare December 19, 2019 16:33

mavam force-pushed the story/ch4493 branch 2 times, most recently from 05fdebe to 887eecb Compare January 7, 2020 16:23

mavam added 14 commits January 8, 2020 20:01

Make string_view hashable

b24b370

Enhance select range with fast-forward capability

8086698

Remove redundant attribute extractor helper

29c5fcb

Make data views hashable and generalize hash index

45b6e45

Fix style violation

63c31d6

Add changelog entry

16604f1

Fix hash index serialization

6e7f07c

Redo implemenation with new approach

5357a40

Include nil values in inequality queries

e0e3382

Remove comparison function causing compiler issues

6a63db6

Prohibit append after deserialization

7c74f45

Rework unit tests with nil values

a5a4617

mavam and others added 4 commits January 8, 2020 20:02

Add function that adds #index=hash to Zeek fields

cc1b8e6

Add community_id as hash-index field

204f565

Do not print attributes in Zeek type header

6b96185

Support containers on RHS

3512f7d

mavam force-pushed the story/ch4493 branch from 887eecb to 3512f7d Compare January 8, 2020 19:03

Support equality comparison of data and views

30fa408

mavam force-pushed the story/ch4493 branch 2 times, most recently from a0dcc1c to 3ca898e Compare January 8, 2020 21:17

Avoid unnecessary view materialization

060a0fd

mavam force-pushed the story/ch4493 branch from 3ca898e to 060a0fd Compare January 9, 2020 08:42

tobim reviewed Jan 9, 2020

View reviewed changes

libvast/src/value_index_factory.cpp Show resolved Hide resolved

dominiklohmann requested changes Jan 9, 2020

View reviewed changes

tobim requested changes Jan 9, 2020

View reviewed changes

libvast/src/format/zeek.cpp Outdated Show resolved Hide resolved

libvast/src/value_index_factory.cpp Show resolved Hide resolved

libvast/vast/value_index.hpp Outdated Show resolved Hide resolved

tobim reviewed Jan 9, 2020

View reviewed changes

libvast/src/value_index_factory.cpp Show resolved Hide resolved

mavam added 4 commits January 9, 2020 14:26

Replace obsolete do-while loop with for loop

3a27618

Fixup comments

a239159

Remove unnecessary include

cafbead

Add missing const qualifier

ede727e

Fix integration test

8b59ec1

This only popped up because of an overload in the Zeek type printer that didn't get triggered because it lacked a const qualifier.

tobim previously approved these changes Jan 9, 2020

View reviewed changes

Improve documentation of Bytes template parameter

d8a2655

mavam dismissed tobim’s stale review via d8a2655 January 9, 2020 16:56

dominiklohmann approved these changes Jan 10, 2020

View reviewed changes

dominiklohmann merged commit 00e8abf into master Jan 10, 2020

dominiklohmann deleted the story/ch4493 branch January 10, 2020 09:42

tobim mentioned this pull request Jan 28, 2020

Generalize hash indexes to all data types #726

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized implementation for string indexes #632

Add optimized implementation for string indexes #632

mavam commented Nov 1, 2019 •

edited

Loading

mavam commented Nov 1, 2019

mavam commented Nov 2, 2019

tobim left a comment

dominiklohmann left a comment

tobim left a comment

dominiklohmann commented Jan 9, 2020

mavam commented Jan 9, 2020

Add optimized implementation for string indexes #632

Add optimized implementation for string indexes #632

Conversation

mavam commented Nov 1, 2019 • edited Loading

mavam commented Nov 1, 2019

mavam commented Nov 2, 2019

tobim left a comment

Choose a reason for hiding this comment

dominiklohmann left a comment

Choose a reason for hiding this comment

tobim left a comment

Choose a reason for hiding this comment

dominiklohmann commented Jan 9, 2020

mavam commented Jan 9, 2020

mavam commented Nov 1, 2019 •

edited

Loading