[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

msfroh · 2024-02-15T20:45:41Z

Is your feature request related to a problem? Please describe

I recently read this blog post, where the author claims a 10x speedup on a large terms query by encoding the field values as a roaring bitmap.

I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in #11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a LongHashSet, which I think should perform similarly to RoaringBitmap.)

The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.

Similarly, we have the terms lookup feature on the terms query, but it's doing a kind-of naive "fetch an array of strings" approach.

My idea is to borrow the roaring bitmap idea from the linked blog post and add that to terms query's lookup.

Describe the solution you'd like

I would like to modify the terms lookup feature to add a new (opt-in) "protocol" between the main index and the term lookup index. The term lookup index should assign consistent, increasing ordinals to the values in the terms field. When the main index queries the term lookup index, it should pass a bitset of the ordinals whose values it "knows" (cached from previous requests). After finding the matching document in the term lookup index, the response should carry a bitset of matching ordinals, along with the ordinal-to-value mapping for any unknown values. This should allow us to carry very large sets of IDs across the index boundary in a compact representation.

As a next step, the term lookup should support multiple lookup keys and Boolean/bitset operations between them. The term lookup index will return the bitset after performing the Boolean operations. (The term lookup index may cache the result of these Boolean operations.)

Related component

Search:Query Capabilities

Describe alternatives you've considered

I drafted (on my computer) a whole idea around a new query type that would work with numeric ID bitsets, either passed in the query or stored in a custom data store (probably implemented as an OpenSearch index, but could be somewhere else).

My concerns with that were:

It forces folks to use numeric IDs, which may not be a viable option.
It would have added lots of new APIs (to organize, create, and manage bitsets), as well as a whole new query type.

Making the existing lookup feature of terms query "smarter" feels like a lot less work for me and for users.

Additional context

As I mentioned to above, I had drafted a proposal on my computer to build a whole dedicated API. While I no longer think that's the right move, my proposal did have some nice examples of possible use-cases:

Example 1 - Digital entitlements

An example would be a digital content entitlement system, with each document in the search index corresponding to a digital product. End-users can purchase access to content. When an end-user searches their library, they should only see content to which they have access. Updating each piece of content whenever a user makes a purchase is not practical, since a single item of content may be owned by many users. Instead, we would like each user

Example 2 - Multi-location retail search

A retail grocery chain offers online ordering and delivery across many store locations. They have a single catalog of products, but each location may carry a different selection of products. Product selection at any store may fluctuate as inventory sells out and new inventory is delivered. When a user in a given city searches for products, they should see items currently available from their local stores. There should be an updatable collection for each store and a user should be able to search across the union of products available from their local stores.

The text was updated successfully, but these errors were encountered:

msfroh · 2024-02-15T23:23:57Z

Of course, if we're concerned about sending massive queries from the client to the server, we can always use the "terms lookup" feature for a terms query.

More broadly, I wonder if there are any other query types where people might send very large sets of data that we could encode more efficiently from the client to the coordinator. Vector search, maybe?

peternied · 2024-02-21T16:39:46Z

[Triage - attendees 1 2 3 4 5]
@msfroh Thanks for creating this issue; however, it isn't being accepted due to not having a clear outcome - seems more like an RFC would be a clearer starting point. Please feel free to open a new issue after addressing the reason.

msfroh · 2024-05-20T22:16:34Z

I added a clearer outcome to the description and assigned the issue to myself.

ochrist-eis · 2024-06-07T10:16:59Z

This would be a great enhancement (and differentiator). We, like others, have a use case which does not scale using terms or terms lookup, and joins don't perform well enough either. The ability to quickly test whether an integer record field value is included in a large list of integers (10K..10M in extreme cases) would be extremely valuable and avoid custom plugins. An approach using Roaring Bitmaps with client-side encoded query bitmaps would be even more flexible since one could express arbitrary bitset operations as query constraints and the data would be encoded and transported more efficiently.

bowenlan-amzn · 2024-06-12T17:34:27Z

I'm starting to dive into this issue and taking guidance from @msfroh.

msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 15, 2024

github-actions bot added the Search:Query Capabilities label Feb 15, 2024

github-project-automation bot added this to Search Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 15, 2024

peternied closed this as completed Feb 21, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 21, 2024

msfroh self-assigned this May 20, 2024

msfroh reopened this May 20, 2024

github-project-automation bot moved this from ✅ Done to 🏗 In progress in Search Project Board May 20, 2024

msfroh changed the title ~~[Feature Request] Support more efficient encoding of filters over many terms~~ [Feature Request] terms query's term lookup should be able to efficiently handle 100k+ terms May 20, 2024

msfroh changed the title ~~[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ terms~~ [Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms May 28, 2024

andrross removed the untriaged label May 29, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

bowenlan-amzn self-assigned this Jun 12, 2024

harshavamsi added this to Performance Roadmap Jun 17, 2024

github-project-automation bot moved this to Todo in Performance Roadmap Jun 17, 2024

harshavamsi moved this from Todo to Now (This Quarter) in Performance Roadmap Jun 17, 2024

getsaurabh02 moved this from Now (This Quarter) to In Progress in Performance Roadmap Jun 24, 2024

bowenlan-amzn mentioned this issue Jul 16, 2024

Support Filtering on Large List encoded by Bitmap #14774

Merged

11 tasks

mch2 added the v2.17.0 label Jul 22, 2024

mch2 moved this from In Progress to In-Review in Performance Roadmap Aug 5, 2024

msfroh closed this as completed in #14774 Aug 20, 2024

github-project-automation bot moved this from In-Review to Done in Performance Roadmap Aug 20, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Aug 20, 2024

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to 2.17 (First RC 09/03, Release 09/17) in OpenSearch Project Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

msfroh commented Feb 15, 2024 •

edited

Loading

msfroh commented Feb 15, 2024

peternied commented Feb 21, 2024

msfroh commented May 20, 2024

ochrist-eis commented Jun 7, 2024 •

edited

Loading

bowenlan-amzn commented Jun 12, 2024

[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

Comments

msfroh commented Feb 15, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Example 1 - Digital entitlements

Example 2 - Multi-location retail search

msfroh commented Feb 15, 2024

peternied commented Feb 21, 2024

msfroh commented May 20, 2024

ochrist-eis commented Jun 7, 2024 • edited Loading

bowenlan-amzn commented Jun 12, 2024

[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

msfroh commented Feb 15, 2024 •

edited

Loading

ochrist-eis commented Jun 7, 2024 •

edited

Loading