Python: filter elements with an optional filtering function #417

gtsoukas · 2022-09-18T18:29:48Z

Summary of the change

Expose filtering functionality introduced in #402 in Python API. Changes are kept to a minimum, only HNSW index is implemented and not brute force.

For a discussion of the interface design see here.

Preliminary performance characteristics for filtering (not strictly related to the changes):

filter	queries per second
0.1	4241.43
0.2	5993.89
0.3	8232.54
0.4	10688.06
0.5	11677.54
0.6	13242.56
0.7	14520.50
0.8	14659.25
0.9	14932.67
none	503578.34

Here filter denotes the fraction of elements the query was restricted to. none denotes that no filtering has been applied.

As per the above table, there is a threshold below which exact ANN is probably preferable.

gtsoukas · 2022-09-22T14:32:43Z

I did some benchmarks with a 150k vector dataset of 64 dimensions with inner product.(will have to repeat on a public dataset in order to be able to share details). The general finding is, that the introduced filter function only gives competitive results vs. brute force, when only a small fraction of items are excluded from the 150k vector set. So this is an important info that users of the filter function must be aware of, it can be useful, but in some scenarios there is extreme performance degradation.

yurymalkov · 2022-09-23T19:45:20Z

Hi @gtsoukas,
Thank you for the PR! We will review it and test it shortly.

gtsoukas · 2022-10-14T15:59:07Z

Thank you @yurymalkov ,

it would be great to have this functionality. Here are some benchmarks demonstrating on a public dataset that the filtering functionality for medium to large fractions of the search space clearly outperforms brute force search and a server based solution that supports filtered ANN. https://github.com/gtsoukas/filtered-ann-benchmarks.

python_bindings/bindings.cpp

gtsoukas · 2022-10-16T21:13:04Z

Thank you so much for the review @dyashuni!

I don't know what to do about the last comment (line 613). Happy to dig deeper if you could give me a direction.

dyashuni · 2022-10-24T06:54:56Z

@gtsoukas Thank you for the updates!
Let me discuss with @yurymalkov about the code changes. There are possible options here.

dyashuni · 2022-10-30T15:42:35Z

@gtsoukas Thank you!
I made several changes above yours: removed template filter_func_t, added filter to brute force index and updated tests.
I attached the patch file.
Could you please apply it and update the pull request?
git apply patch.txt
patch.txt

Comments are welcome!

…pdat tests (credits go to dyashuni)

gtsoukas · 2022-11-02T10:43:09Z

@dyashuni, thanks a million times for the patch! I have applied the patch as-is.

All my functional and performance tests look good after application of the patch.

For someone not very familiar with C++, I find that the changes you made make the library easier to use.

I don't have to add anything. It is ready to merge from my side.

One thing that maybe could be improved in the context of this PR but not strictly related, is that the concepts of label, id and index could be sharpened: My understanding is, that when using the Python bindings, we always use indices but never ids or labels. As a first step this could be made more clear by not calling parameters "id" (e.g. line 877 of python_bindings/bindings.py). Ideally, there would be a way to handle actual labels/ids via the Python bindings. For me, the difference between an index on the one side and the label/id on the other side is that the first is restricted to natural integers, ideally without gaps whereas the latter can be of any type. Currently I am maintaining separate index/label mappings, when using the python version of hnswlib. This is however a tiny criticism for an awesome library!

dyashuni · 2022-11-05T08:22:29Z

@gtsoukas Thank you!
I’m not sure that I get everything correctly.
There is inconsistency in naming. In hnswalg.h we have "labels". In python_bindings we have "ids". Actually they mean the same.
In python when we add items (add_items function) we can also specify corresponding labels/ids via ids parameter. If the ids parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items in get_items function.

gtsoukas · 2022-11-09T10:10:28Z

In python when we add items (add_items function) we can also specify corresponding labels/ids via ids parameter. If the ids parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items in get_items function.

Thanks for explaining @dyashuni !
I was using the id parameter of the Python add_item method but with non-integer ids and received an error (ValueError: invalid literal for int() with base 10: ' 58725ab=>'). From this I probably jumped to the wrong conclusion that hnswlib (python version) does not handle ids whereas it actually handles ids as long as they are integers.
Best forget my comment, as it is unrelated to this pull request anyway and I will write a new issue once I have somethin more concrete.

dyashuni · 2022-11-09T18:42:16Z

@gtsoukas Got it, thank you!
Yes, we now support only integer ids. If you would like to use non-integer ids in your applications you can create mapping between non-integer ids to integer ids, and use the latter in hnswlib. It is not very convenient of course.

gtsoukas added 2 commits September 18, 2022 17:52

Add Python filter option for knn query.

1f53470

Merge develop and resolve merge conflicts

dc80ab9

dyashuni reviewed Oct 16, 2022

View reviewed changes

Implemnt review suggestions

cebf89d

jeffchuber mentioned this pull request Oct 27, 2022

end to end working chroma-core/chroma#29

Merged

3 tasks

Removed template filter_func_t, add filter to brute force index and u…

f03ff8a

…pdat tests (credits go to dyashuni)

dyashuni merged commit 983cea9 into nmslib:develop Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: filter elements with an optional filtering function #417

Python: filter elements with an optional filtering function #417

gtsoukas commented Sep 18, 2022 •

edited

Loading

gtsoukas commented Sep 22, 2022

yurymalkov commented Sep 23, 2022

gtsoukas commented Oct 14, 2022

gtsoukas commented Oct 16, 2022

dyashuni commented Oct 24, 2022

dyashuni commented Oct 30, 2022

gtsoukas commented Nov 2, 2022 •

edited

Loading

dyashuni commented Nov 5, 2022

gtsoukas commented Nov 9, 2022

dyashuni commented Nov 9, 2022

Python: filter elements with an optional filtering function #417

Python: filter elements with an optional filtering function #417

Conversation

gtsoukas commented Sep 18, 2022 • edited Loading

Summary of the change

Preliminary performance characteristics for filtering (not strictly related to the changes):

gtsoukas commented Sep 22, 2022

yurymalkov commented Sep 23, 2022

gtsoukas commented Oct 14, 2022

gtsoukas commented Oct 16, 2022

dyashuni commented Oct 24, 2022

dyashuni commented Oct 30, 2022

gtsoukas commented Nov 2, 2022 • edited Loading

dyashuni commented Nov 5, 2022

gtsoukas commented Nov 9, 2022

dyashuni commented Nov 9, 2022

gtsoukas commented Sep 18, 2022 •

edited

Loading

gtsoukas commented Nov 2, 2022 •

edited

Loading