-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add approximate filters #607
Add approximate filters #607
Conversation
@@ -22,7 +22,7 @@ def parse_string_parameter(key: str, params: dict, default: str = None) -> str: | |||
|
|||
def parse_int_parameter(key: str, params: dict, default: int = None) -> int: | |||
if key not in params: | |||
if default: | |||
if default is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is technically not related to this PR, but the previous if
statement throws a ConfigurationError if default == 0
. All unit tests pass in both configurations.
@@ -1254,6 +1254,17 @@ def _get_field_value(content, field_name): | |||
return _get_field_value(content["_source"], field_name) | |||
return None | |||
|
|||
def binary_search_for_last_negative_1(neighbors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use a Python utility here but I couldn't find one in the standard library. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not aware of anything. Also, given that the list may not be sorted, it might be harder to do. That being said, I think this is good.
@@ -1270,7 +1281,20 @@ def calculate_topk_search_recall(predictions, neighbors, top_k): | |||
self.logger.info("No neighbors are provided for recall calculation") | |||
return 0.0 | |||
min_num_of_results = min(top_k, len(neighbors)) | |||
last_neighbor_is_negative_1 = int(neighbors[min_num_of_results-1]) == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is necessary when the dataset is first generated and then neighbors are filtered out. This occurs in some of the perf tools datasets like here. These datasets first generate true neighbors for the test vector, such as [5, 2, 3]
, and then post-filter the neighbors with attributes that do not pass the filter. The documents without the attributes are replaced by -1
. However looking more closely at the dataset generation code I'm not sure that all of the -1
entries are at the end of the array. @martin-gaievski I see you wrote the filtering code, could you please clarify? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they should be at the end of collection of nearest neighbors for each query, although I'm not clearly remember sorting them. Can you share example where -1 are in the row, but they are mixed with some other ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any in testing, but wanted to ask since I couldn't find a sort. I took another look at the code for filtering neighbors and it looks like the initial array with all -1
entries is overwritten from left to right, so any remaining -1
ids are at the end. I don't think we'll use these filter datasets anyways (since Heemin's nested fields script generates k
true neighbors after filtering instead of generating k
true neighbors and then post-filtering to have <k
neighbors), but I'll call out the assumption that all -1
are at the end in a comment.
"script_score": { | ||
"query": {"bool": {"filter": filter_body}}, | ||
"script": { | ||
"source": "knn_score", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use knn script score here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it in OSB since it's mentioned in the filters documentation but I haven't written a workload yet since Vijay said it wasn't a priority. Is Knn script score something that we would like to benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with @VijayanB and it seems like @navneet1v is adding a feature in 2.17 where exact search can be performed via a knn query (as opposed to the script score query). This change should be much easier to benchmark than exact search via a script_score
query since it would just require a new workload without any OSB runner changes. I'll remove this script score code in the next revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@finnroblin but we will still need script score based benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15, @finnroblin do we not want to add script score based benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably shoud have this as well. But I was thinking that exact search should use new method for implementing it
truth_set = neighbors[:min_num_of_results] | ||
if last_neighbor_is_negative_1: | ||
self.logger.info("Last neighbor is -1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this logging, if you want please make it debug
@finnroblin Can you rebase this PR? |
e3821d9
to
9ea9f8c
Compare
Signed-off-by: Finn Roblin <finnrobl@amazon.com>
9ea9f8c
to
b56713b
Compare
Description
This change adds support for ingesting non-vector fields from an HDF5 dataset into OpenSearch k-NN. It also adds support for writing post-filtering queries. These changes aid the transition from k-NN's perf tool to using OSB for release benchmarking.
Please see OSB Workloads PR 364 for usage examples.
Testing
Unit tests were added that cover the new param and runner methods. Additionally all of the functionality has been tested via the parameter files in OSB Workloads PR 364.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.