Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Filtering on Large List encoded by Bitmap #14774

Merged
merged 35 commits into from
Aug 20, 2024

Conversation

bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Jul 16, 2024

Problem

To retrieve the documents that match at least one item from a given list, we can use the terms query. We can even save the filter list in an document and use terms lookup to fetch that and feed into the terms query.

However, as the filter becomes larger, the memory and network transmit overhead increase. And this overhead disproportionately affects the latency and TPS when the filter becomes huge (10k+ items), making it un-useable.

Proposal

We can use RoaringBitmap to encode the filter which provides less memory and bandwidth usage and fast and deterministic in-memory random access or lookup.

User Story

Users want to filter/join a main index with a bitmap filter on a numeric field.
e.g. The index contains product ids and other data related to products. Each filter represents the owned products of a customer. The filter is a list of numeric product ids.

  • Users want to store the bitmap filter as documents in OpenSearch and join the filter with main index at query time. (existing terms lookup feature)
    • Users want to do boolean operations across multiple saved bitmaps at query time using the boolean queries.
    • (Optional) Users want to update the stored filters with bitmap operations like add, remove, etc.
  • Users want to directly do terms query with a bitmap built at client side w/o having to save it.
An example experience using terms lookup

Uses create a RoaringBitmap for a filter on the client side, serialize the bitmap to byte array and encode the byte array using base64.

bm = BitMap([111, 222, 333]) # product ids
encoded = base64.b64encode(BitMap.serialize(bm))

Users index/store the bitmap filter in a binary field (customer_filter) of an OpenSearch index (customers). The id of the document is the identifier of the customer associated with this filter.

# index mapping using binary field with stored field enabled
# we don't need it to be saved in _source so can exclude it
{
    "mappings": {
        "properties": {
            "customer_filter": {
                "type": "binary",
                "store": true
            },
            "_source": {
                "excludes": [
                    "customer_filter"
                ]
            }
        }
    }
}

POST customers/_doc/customer123
{
  "customer_filter": "OjAAAAEAAAAAAAEAEAAAAG8A3gA=" <-- base64 encoded serizlied bitmap
}

Users do a terms lookup query on products index (products) with a lookup on customers of certain customer id.

POST products/_search
{
  "query": {
    "terms": {
      "product_id": {
        "index": "customers",
        "id": "customer123",
        "path": "customer_filter",
		"store": true  <-- new parameter to do the lookup on the stored field, instead of _source
      },
      "value_type": "bitmap" <-- new parameter in terms query to specify the data type of the terms values input
    }
  }
}

User do a normal terms query and pass in bitmap

POST products/_search
{
  "query": {
    "terms": {
      "product_id": "<customer_filter>",
      "value_type": "bitmap"
    }
  }
}

User do a boolean query with boolean operation between multiple filters

POST products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "product_id": "<customer1_filter>",
            "value_type": "bitmap"
          }
        }
      ],
      "should": [
        {
          "terms": {
            "product_id": "<customer2_filter>",
            "value_type": "bitmap"
          }
        },
        {
          "terms": {
            "product_id": "<customer3_filter>",
            "value_type": "bitmap"
          }
        }
      ],
      "must_not": [
        {
          "terms": {
            "product_id": "<customer4_filter>",
            "value_type": "bitmap"
          }
        }
      ]
    }
  }
}

Implementation

Functional Requirements

  • New bitmap queries that accepts bitmap and tells what documents match with this bitmap. 2 types of queries that can be wrapped with IndexOrDocValueQuery.
    • A bitmap query (Doc Value type) can efficiently verify document match
    • A bitmap query (Index type) can efficiently be used to produce a lead iterator
  • Terms lookup on a stored bitmap in a document
    • Previously terms lookup feature only fetches the _source of a document using get by id request. Adding support to fetch from the stored field. This is needed by binary field because its source would be the base64 encoded string.
  • Support bitmap as the field values in terms query

Non-functional Requirements

  • When the filter is small, directly pass in the filter should still be better, however, when the filter becomes huge, encode and pass as bitmap would be better. Do benchmark to recommend users a general threshold here.
  • The payload of a request has limit, so how big a bitmap users can query on is limited by that. Do experiment to recommend users the maximum number of numbers in a bitmap they can query on.

Related Issues

Resolves #12341

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities labels Jul 16, 2024
Copy link
Contributor

❌ Gradle check result for 63f1cd4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@smacrakis
Copy link

LGTM!

terms query delegate to bitmap query

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❌ Gradle check result for deeb3ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@bowenlan-amzn
Copy link
Member Author

7/19: I have a working draft version of this bitmap filtering feature using terms lookup.
A passing yaml test is added and can give you an idea on how to use this feature

Will continue after 7/28

Copy link
Contributor

❌ Gradle check result for 7b2ddb8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❌ Gradle check result for 577e6d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❌ Gradle check result for 2e67647: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❌ Gradle check result for 3cd735e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❕ Gradle check result for b9bf2d4: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 76.59574% with 33 lines in your changes missing coverage. Please review.

Project coverage is 71.86%. Comparing base (cbe7921) to head (9c1c039).
Report is 9 commits behind head on main.

Files Patch % Lines
.../org/opensearch/index/query/TermsQueryBuilder.java 76.78% 2 Missing and 11 partials ⚠️
...org/opensearch/index/mapper/NumberFieldMapper.java 62.96% 8 Missing and 2 partials ⚠️
.../opensearch/search/query/BitmapDocValuesQuery.java 86.66% 2 Missing and 4 partials ⚠️
.../main/java/org/opensearch/indices/TermsLookup.java 69.23% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14774      +/-   ##
============================================
+ Coverage     71.82%   71.86%   +0.04%     
- Complexity    63046    63104      +58     
============================================
  Files          5207     5208       +1     
  Lines        295581   295712     +131     
  Branches      42690    42723      +33     
============================================
+ Hits         212295   212525     +230     
+ Misses        65875    65682     -193     
- Partials      17411    17505      +94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

❕ Gradle check result for 389f469: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search/370_bitmap_filtering/Terms query accepting bitmap as value}
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search/370_bitmap_filtering/Terms query accepting bitmap as value}
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search/370_bitmap_filtering/Boolean should bitmap filtering}
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search/370_bitmap_filtering/Boolean must bitmap filtering}
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search/370_bitmap_filtering/Boolean must bitmap filtering}

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Copy link
Contributor

github-actions bot commented Aug 8, 2024

❌ Gradle check result for c51dfc8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code pretty carefully and left three comments.

In my opinion, none of them are blockers for this PR.

Signed-off-by: Michael Froh <froh@amazon.com>
@msfroh msfroh requested a review from linuxpi as a code owner August 16, 2024 17:34
Copy link
Contributor

❌ Gradle check result for 9c1c039: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for c51dfc8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 9c1c039: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh
Copy link
Collaborator

msfroh commented Aug 20, 2024

I don't seem to be able to retrigger the Mend Security Check.

It's not configured to block merges, though, so I believe it's safe to merge anyway.

@msfroh msfroh merged commit 52ecbe9 into opensearch-project:main Aug 20, 2024
34 of 35 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 20, 2024
---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: Michael Froh <froh@amazon.com>
Co-authored-by: Michael Froh <froh@amazon.com>
(cherry picked from commit 52ecbe9)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
msfroh added a commit that referenced this pull request Aug 21, 2024
)

* Support Filtering on Large List encoded by Bitmap (#14774)

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: Michael Froh <froh@amazon.com>
Co-authored-by: Michael Froh <froh@amazon.com>
(cherry picked from commit 52ecbe9)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Update version checks to look for 2.17.0

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: Michael Froh <froh@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michael Froh <froh@amazon.com>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
…#14774)


---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: Michael Froh <froh@amazon.com>
Co-authored-by: Michael Froh <froh@amazon.com>
@bowenlan-amzn bowenlan-amzn deleted the 12341-bitmap-filtering branch August 26, 2024 10:01
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
…#14774)


---------

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: Michael Froh <froh@amazon.com>
Co-authored-by: Michael Froh <froh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Search:Query Capabilities v2.17.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms
3 participants