Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle degenerate case where all HNSW search candidates are filtered #11787

Open
msokolov opened this issue Sep 19, 2022 · 3 comments
Open

Handle degenerate case where all HNSW search candidates are filtered #11787

msokolov opened this issue Sep 19, 2022 · 3 comments
Labels

Comments

@msokolov
Copy link
Contributor

Description

This test failure reproduces every time. What seems to happen is that we search with a filter that retains > 50% of documents yet we hit an unlucky condition where the graph is not fully connected and every candidate node we visit gets filtered, so we end up with 0 results. It's kind of a degenerate case that is pretty unlikely to arise in a real graph, yet it seems we ought to have some kind of fallback to exact search for this case.

./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestKnnVectorQuery.testFilterWithSameScore" -Ptests.jvms=1 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=C5E04AD69C13E006 -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1

Version and environment details

No response

@msokolov
Copy link
Contributor Author

This test is really testing a pathological case ... when the vectors are all the same everything is equidistant from everything else and "nearest neighbor" ceases to really even mean anything. I'm not sure we should actually have this test other than to verify that there is no crash. Maybe I'm misunderstanding, but what it the test really asserting?

@jtibshirani
Copy link
Member

Thanks for digging into this! I added this test to exercise the tie-breaking logic. But now I think it wasn't a good idea -- HNSW is known to exhibit very poor performance when vectors are duplicated. And this test takes it to an extreme! It's not really a scenario we support well.

Maybe we could just remove this test. It wasn't critical, and I could always follow-up with a better way to test tie-breaking.

@msokolov
Copy link
Contributor Author

We could keep the test if we did this: #11790 which would cause fallback to a full scan in this kind of case. It seems like a reasonable fallback to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants