-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: filter elements with an optional filtering function #417
Conversation
I did some benchmarks with a 150k vector dataset of 64 dimensions with inner product.(will have to repeat on a public dataset in order to be able to share details). The general finding is, that the introduced filter function only gives competitive results vs. brute force, when only a small fraction of items are excluded from the 150k vector set. So this is an important info that users of the filter function must be aware of, it can be useful, but in some scenarios there is extreme performance degradation. |
Hi @gtsoukas, |
Thank you @yurymalkov , it would be great to have this functionality. Here are some benchmarks demonstrating on a public dataset that the filtering functionality for medium to large fractions of the search space clearly outperforms brute force search and a server based solution that supports filtered ANN. https://github.com/gtsoukas/filtered-ann-benchmarks. |
Thank you so much for the review @dyashuni! I don't know what to do about the last comment (line 613). Happy to dig deeper if you could give me a direction. |
@gtsoukas Thank you for the updates! |
…pdat tests (credits go to dyashuni)
@dyashuni, thanks a million times for the patch! I have applied the patch as-is. All my functional and performance tests look good after application of the patch. For someone not very familiar with C++, I find that the changes you made make the library easier to use. I don't have to add anything. It is ready to merge from my side. One thing that maybe could be improved in the context of this PR but not strictly related, is that the concepts of label, id and index could be sharpened: My understanding is, that when using the Python bindings, we always use indices but never ids or labels. As a first step this could be made more clear by not calling parameters "id" (e.g. line 877 of python_bindings/bindings.py). Ideally, there would be a way to handle actual labels/ids via the Python bindings. For me, the difference between an index on the one side and the label/id on the other side is that the first is restricted to natural integers, ideally without gaps whereas the latter can be of any type. Currently I am maintaining separate index/label mappings, when using the python version of hnswlib. This is however a tiny criticism for an awesome library! |
@gtsoukas Thank you! |
Thanks for explaining @dyashuni ! |
@gtsoukas Got it, thank you! |
Summary of the change
Expose filtering functionality introduced in #402 in Python API. Changes are kept to a minimum, only HNSW index is implemented and not brute force.
For a discussion of the interface design see here.
Preliminary performance characteristics for filtering (not strictly related to the changes):
Here filter denotes the fraction of elements the query was restricted to. none denotes that no filtering has been applied.
As per the above table, there is a threshold below which exact ANN is probably preferable.