-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] terms
query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms
#12341
Comments
Of course, if we're concerned about sending massive queries from the client to the server, we can always use the "terms lookup" feature for a terms query. More broadly, I wonder if there are any other query types where people might send very large sets of data that we could encode more efficiently from the client to the coordinator. Vector search, maybe? |
I added a clearer outcome to the description and assigned the issue to myself. |
terms
query's term lookup should be able to efficiently handle 100k+ terms
terms
query's term lookup should be able to efficiently handle 100k+ termsterms
query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms
This would be a great enhancement (and differentiator). We, like others, have a use case which does not scale using terms or terms lookup, and joins don't perform well enough either. The ability to quickly test whether an integer record field value is included in a large list of integers (10K..10M in extreme cases) would be extremely valuable and avoid custom plugins. An approach using Roaring Bitmaps with client-side encoded query bitmaps would be even more flexible since one could express arbitrary bitset operations as query constraints and the data would be encoded and transported more efficiently. |
I'm starting to dive into this issue and taking guidance from @msfroh. |
Is your feature request related to a problem? Please describe
I recently read this blog post, where the author claims a 10x speedup on a large
terms
query by encoding the field values as a roaring bitmap.I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in #11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a
LongHashSet
, which I think should perform similarly to RoaringBitmap.)The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.
Similarly, we have the terms lookup feature on the
terms
query, but it's doing a kind-of naive "fetch an array of strings" approach.My idea is to borrow the roaring bitmap idea from the linked blog post and add that to
terms
query's lookup.Describe the solution you'd like
I would like to modify the terms lookup feature to add a new (opt-in) "protocol" between the main index and the term lookup index. The term lookup index should assign consistent, increasing ordinals to the values in the terms field. When the main index queries the term lookup index, it should pass a bitset of the ordinals whose values it "knows" (cached from previous requests). After finding the matching document in the term lookup index, the response should carry a bitset of matching ordinals, along with the ordinal-to-value mapping for any unknown values. This should allow us to carry very large sets of IDs across the index boundary in a compact representation.
As a next step, the term lookup should support multiple lookup keys and Boolean/bitset operations between them. The term lookup index will return the bitset after performing the Boolean operations. (The term lookup index may cache the result of these Boolean operations.)
Related component
Search:Query Capabilities
Describe alternatives you've considered
I drafted (on my computer) a whole idea around a new query type that would work with numeric ID bitsets, either passed in the query or stored in a custom data store (probably implemented as an OpenSearch index, but could be somewhere else).
My concerns with that were:
Making the existing lookup feature of
terms
query "smarter" feels like a lot less work for me and for users.Additional context
As I mentioned to above, I had drafted a proposal on my computer to build a whole dedicated API. While I no longer think that's the right move, my proposal did have some nice examples of possible use-cases:
Example 1 - Digital entitlements
An example would be a digital content entitlement system, with each document in the search index corresponding to a digital product. End-users can purchase access to content. When an end-user searches their library, they should only see content to which they have access. Updating each piece of content whenever a user makes a purchase is not practical, since a single item of content may be owned by many users. Instead, we would like each user
Example 2 - Multi-location retail search
A retail grocery chain offers online ordering and delivery across many store locations. They have a single catalog of products, but each location may carry a different selection of products. Product selection at any store may fluctuate as inventory sells out and new inventory is delivered. When a user in a given city searches for products, they should see items currently available from their local stores. There should be an updatable collection for each store and a user should be able to search across the union of products available from their local stores.
The text was updated successfully, but these errors were encountered: