Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune expressions for the meta index lookup #1433

Merged
merged 4 commits into from
Mar 12, 2021

Conversation

tobim
Copy link
Member

@tobim tobim commented Mar 11, 2021

📔 Description

We can take advantage of the fact that all string fields are covered by the same type level synopsis. To be more specific: A query like suricata.smb.host == "foo" || suricata.ssh.host == "foo" would check whether the string "foo" is included the same type-level string synopsis twice.

This PR adds a preprocessing step that removes predicates with duplicate strings from the expression as a preprocessing step for meta index lookups.

In my local comparisons with 12723 partitions with eve.log data I get the a 4 times speedup for the query vast export null 'net.domain == "alhgeoafh" || net.hostname == "alhgeoafh"'.
A release build on master finishes the command in ~ 930 ms, this branch is done after ~ 240 ms.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

Try to reproduce my results.

@tobim tobim added the performance Improvements or regressions of performance label Mar 11, 2021
@tobim tobim requested a review from a team March 11, 2021 11:49
@dominiklohmann dominiklohmann self-assigned this Mar 11, 2021
@dominiklohmann
Copy link
Member

I'll review this one in practice, I still have a 1TB database with Zeek streaming JSON on my hard drive.

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the performance bump: I've seen a reduction from 2s to 0.4s on a similar query. There seems to be no visible performance penalty for regular queries.

I can reproduce the ASan failures in vast-test locally. Please fix and add a changelog entry, other than that LGTM.

@tobim tobim force-pushed the story/ch23175/meta-index-redundancy branch from 750f8aa to 890c6a7 Compare March 11, 2021 15:57
@tobim
Copy link
Member Author

tobim commented Mar 11, 2021

I'm not sure whether I should add a changelog entry for this.

@mavam
Copy link
Member

mavam commented Mar 11, 2021

@tobim if the performance gains are substantial, it's a nice change item. As a user, I always enjoy reading gains if they are specific and to the point (while broad claims are rather a turnoff).

@tobim tobim force-pushed the story/ch23175/meta-index-redundancy branch from 890c6a7 to 4c52e02 Compare March 11, 2021 17:19
CHANGELOG.md Outdated Show resolved Hide resolved
libvast/src/system/meta_index.cpp Show resolved Hide resolved
libvast/src/system/meta_index.cpp Show resolved Hide resolved
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, but please make the changelog entry either a Feature of a Change instead of introducing a new category.

We can take advantage of the fact that all string fields are covered
by the same type level synopsis. To be more specific: A query like
`suricata.smb.host == "foo" || suricata.ssh.host == "foo"` would
check whether the string "foo" is included the same type-level string
synopsis twice.

This commit adds a preprocessing step that removes predicates with
duplicate strings from the expression as a preprocessing step for
meta index lookups.
@tobim tobim force-pushed the story/ch23175/meta-index-redundancy branch from cf66ce4 to a9e7364 Compare March 12, 2021 13:19
@tobim tobim merged commit 8b1a634 into master Mar 12, 2021
@tobim tobim deleted the story/ch23175/meta-index-redundancy branch March 12, 2021 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements or regressions of performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants