Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: filter elements with an optional filtering function #417

Merged
merged 4 commits into from
Nov 9, 2022

Conversation

gtsoukas
Copy link
Contributor

@gtsoukas gtsoukas commented Sep 18, 2022

Summary of the change

Expose filtering functionality introduced in #402 in Python API. Changes are kept to a minimum, only HNSW index is implemented and not brute force.

For a discussion of the interface design see here.

Preliminary performance characteristics for filtering (not strictly related to the changes):

filter queries per second
0.1 4241.43
0.2 5993.89
0.3 8232.54
0.4 10688.06
0.5 11677.54
0.6 13242.56
0.7 14520.50
0.8 14659.25
0.9 14932.67
none 503578.34

Here filter denotes the fraction of elements the query was restricted to. none denotes that no filtering has been applied.

As per the above table, there is a threshold below which exact ANN is probably preferable.

@gtsoukas
Copy link
Contributor Author

I did some benchmarks with a 150k vector dataset of 64 dimensions with inner product.(will have to repeat on a public dataset in order to be able to share details). The general finding is, that the introduced filter function only gives competitive results vs. brute force, when only a small fraction of items are excluded from the 150k vector set. So this is an important info that users of the filter function must be aware of, it can be useful, but in some scenarios there is extreme performance degradation.

@yurymalkov
Copy link
Member

Hi @gtsoukas,
Thank you for the PR! We will review it and test it shortly.

@gtsoukas
Copy link
Contributor Author

Thank you @yurymalkov ,

it would be great to have this functionality. Here are some benchmarks demonstrating on a public dataset that the filtering functionality for medium to large fractions of the search space clearly outperforms brute force search and a server based solution that supports filtered ANN. https://github.com/gtsoukas/filtered-ann-benchmarks.

python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
python_bindings/bindings.cpp Outdated Show resolved Hide resolved
@gtsoukas
Copy link
Contributor Author

Thank you so much for the review @dyashuni!

I don't know what to do about the last comment (line 613). Happy to dig deeper if you could give me a direction.

@dyashuni
Copy link
Contributor

@gtsoukas Thank you for the updates!
Let me discuss with @yurymalkov about the code changes. There are possible options here.

@dyashuni
Copy link
Contributor

@gtsoukas Thank you!
I made several changes above yours: removed template filter_func_t, added filter to brute force index and updated tests.
I attached the patch file.
Could you please apply it and update the pull request?
git apply patch.txt
patch.txt

Comments are welcome!

@gtsoukas
Copy link
Contributor Author

gtsoukas commented Nov 2, 2022

@dyashuni, thanks a million times for the patch! I have applied the patch as-is.

All my functional and performance tests look good after application of the patch.

For someone not very familiar with C++, I find that the changes you made make the library easier to use.

I don't have to add anything. It is ready to merge from my side.

One thing that maybe could be improved in the context of this PR but not strictly related, is that the concepts of label, id and index could be sharpened: My understanding is, that when using the Python bindings, we always use indices but never ids or labels. As a first step this could be made more clear by not calling parameters "id" (e.g. line 877 of python_bindings/bindings.py). Ideally, there would be a way to handle actual labels/ids via the Python bindings. For me, the difference between an index on the one side and the label/id on the other side is that the first is restricted to natural integers, ideally without gaps whereas the latter can be of any type. Currently I am maintaining separate index/label mappings, when using the python version of hnswlib. This is however a tiny criticism for an awesome library!

@dyashuni
Copy link
Contributor

dyashuni commented Nov 5, 2022

@gtsoukas Thank you!
I’m not sure that I get everything correctly.
There is inconsistency in naming. In hnswalg.h we have "labels". In python_bindings we have "ids". Actually they mean the same.
In python when we add items (add_items function) we can also specify corresponding labels/ids via ids parameter. If the ids parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items in get_items function.

@dyashuni dyashuni merged commit 983cea9 into nmslib:develop Nov 9, 2022
@gtsoukas
Copy link
Contributor Author

gtsoukas commented Nov 9, 2022

In python when we add items (add_items function) we can also specify corresponding labels/ids via ids parameter. If the ids parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items in get_items function.

Thanks for explaining @dyashuni !
I was using the id parameter of the Python add_item method but with non-integer ids and received an error (ValueError: invalid literal for int() with base 10: ' 58725ab=>'). From this I probably jumped to the wrong conclusion that hnswlib (python version) does not handle ids whereas it actually handles ids as long as they are integers.
Best forget my comment, as it is unrelated to this pull request anyway and I will write a new issue once I have somethin more concrete.

@dyashuni
Copy link
Contributor

dyashuni commented Nov 9, 2022

@gtsoukas Got it, thank you!
Yes, we now support only integer ids. If you would like to use non-integer ids in your applications you can create mapping between non-integer ids to integer ids, and use the latter in hnswlib. It is not very convenient of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants