Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

Closed
msfroh opened this issue Feb 15, 2024 · 5 comments · Fixed by #14774
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities v2.17.0

Comments

@msfroh
Copy link
Collaborator

msfroh commented Feb 15, 2024

Is your feature request related to a problem? Please describe

I recently read this blog post, where the author claims a 10x speedup on a large terms query by encoding the field values as a roaring bitmap.

I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in #11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a LongHashSet, which I think should perform similarly to RoaringBitmap.)

The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.

Similarly, we have the terms lookup feature on the terms query, but it's doing a kind-of naive "fetch an array of strings" approach.

My idea is to borrow the roaring bitmap idea from the linked blog post and add that to terms query's lookup.

Describe the solution you'd like

I would like to modify the terms lookup feature to add a new (opt-in) "protocol" between the main index and the term lookup index. The term lookup index should assign consistent, increasing ordinals to the values in the terms field. When the main index queries the term lookup index, it should pass a bitset of the ordinals whose values it "knows" (cached from previous requests). After finding the matching document in the term lookup index, the response should carry a bitset of matching ordinals, along with the ordinal-to-value mapping for any unknown values. This should allow us to carry very large sets of IDs across the index boundary in a compact representation.

As a next step, the term lookup should support multiple lookup keys and Boolean/bitset operations between them. The term lookup index will return the bitset after performing the Boolean operations. (The term lookup index may cache the result of these Boolean operations.)

Related component

Search:Query Capabilities

Describe alternatives you've considered

I drafted (on my computer) a whole idea around a new query type that would work with numeric ID bitsets, either passed in the query or stored in a custom data store (probably implemented as an OpenSearch index, but could be somewhere else).

My concerns with that were:

  1. It forces folks to use numeric IDs, which may not be a viable option.
  2. It would have added lots of new APIs (to organize, create, and manage bitsets), as well as a whole new query type.

Making the existing lookup feature of terms query "smarter" feels like a lot less work for me and for users.

Additional context

As I mentioned to above, I had drafted a proposal on my computer to build a whole dedicated API. While I no longer think that's the right move, my proposal did have some nice examples of possible use-cases:

Example 1 - Digital entitlements

An example would be a digital content entitlement system, with each document in the search index corresponding to a digital product. End-users can purchase access to content. When an end-user searches their library, they should only see content to which they have access. Updating each piece of content whenever a user makes a purchase is not practical, since a single item of content may be owned by many users. Instead, we would like each user

Example 2 - Multi-location retail search

A retail grocery chain offers online ordering and delivery across many store locations. They have a single catalog of products, but each location may carry a different selection of products. Product selection at any store may fluctuate as inventory sells out and new inventory is delivered. When a user in a given city searches for products, they should see items currently available from their local stores. There should be an updatable collection for each store and a user should be able to search across the union of products available from their local stores.

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 15, 2024
@msfroh
Copy link
Collaborator Author

msfroh commented Feb 15, 2024

Of course, if we're concerned about sending massive queries from the client to the server, we can always use the "terms lookup" feature for a terms query.

More broadly, I wonder if there are any other query types where people might send very large sets of data that we could encode more efficiently from the client to the coordinator. Vector search, maybe?

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@msfroh Thanks for creating this issue; however, it isn't being accepted due to not having a clear outcome - seems more like an RFC would be a clearer starting point. Please feel free to open a new issue after addressing the reason.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 21, 2024
@msfroh msfroh self-assigned this May 20, 2024
@msfroh
Copy link
Collaborator Author

msfroh commented May 20, 2024

I added a clearer outcome to the description and assigned the issue to myself.

@msfroh msfroh reopened this May 20, 2024
@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Search Project Board May 20, 2024
@msfroh msfroh changed the title [Feature Request] Support more efficient encoding of filters over many terms [Feature Request] terms query's term lookup should be able to efficiently handle 100k+ terms May 20, 2024
@msfroh msfroh changed the title [Feature Request] terms query's term lookup should be able to efficiently handle 100k+ terms [Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms May 28, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
@ochrist-eis
Copy link

ochrist-eis commented Jun 7, 2024

This would be a great enhancement (and differentiator). We, like others, have a use case which does not scale using terms or terms lookup, and joins don't perform well enough either. The ability to quickly test whether an integer record field value is included in a large list of integers (10K..10M in extreme cases) would be extremely valuable and avoid custom plugins. An approach using Roaring Bitmaps with client-side encoded query bitmaps would be even more flexible since one could express arbitrary bitset operations as query constraints and the data would be encoded and transported more efficiently.

@bowenlan-amzn
Copy link
Member

I'm starting to dive into this issue and taking guidance from @msfroh.

@bowenlan-amzn bowenlan-amzn self-assigned this Jun 12, 2024
@harshavamsi harshavamsi moved this from Todo to Now (This Quarter) in Performance Roadmap Jun 17, 2024
@getsaurabh02 getsaurabh02 moved this from Now (This Quarter) to In Progress in Performance Roadmap Jun 24, 2024
@mch2 mch2 added the v2.17.0 label Jul 22, 2024
@mch2 mch2 moved this from In Progress to In-Review in Performance Roadmap Aug 5, 2024
@github-project-automation github-project-automation bot moved this from In-Review to Done in Performance Roadmap Aug 20, 2024
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Aug 20, 2024
@github-project-automation github-project-automation bot moved this to 2.17 (First RC 09/03, Release 09/17) in OpenSearch Project Roadmap Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities v2.17.0
Projects
Status: 2.17 (First RC 09/03, Release 09/17)
Status: New
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants