Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize binary search call #13595

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

dungba88
Copy link
Contributor

Description

I think advance is usually not called in backward, so we can run the binary search from the current position +1 instead of 0. Also cap the fromIndex to scoreDocs.length to avoid overflow.

From the Javadoc of advance

   * <p>The behavior of this method is <b>undefined</b> when called with <code> target &le; current
   * </code>, or after the iterator has exhausted. Both cases may result in unpredicted behavior.

Copy link
Contributor

@kaivalnp kaivalnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Do we have suitable benchmarks to measure this (ideally a combination of lexical + vector search)?

@jpountz
Copy link
Contributor

jpountz commented Jul 31, 2024

This makes sense to me. Maybe we should go one step further and perform an exponential search instead, e.g. by reusing IntArrayDocIdSetIterator.

@dungba88
Copy link
Contributor Author

dungba88 commented Aug 1, 2024

The advance will keep reducing the array size and we will generally advance small steps ahead right? Then I think exponential search makes sense. I'll try to use IntArrayDocIdSetIterator in next rev

@gsmiller
Copy link
Contributor

gsmiller commented Aug 1, 2024

I think exponential search will only outperform binary search in this case if we expect the next target to be relatively close to the "min" we're constantly "pushing up" (thanks to your change). Is that the case? (Specifically, I think the math works out that exponential search is only better if the target is in the next sqrt(N) elements where N is the size of the remaining list being searched... but I could be wrong).

@jpountz
Copy link
Contributor

jpountz commented Aug 1, 2024

If DocIdSetIterator#advance gets called on large increments, then there are only so many calls that can be done because the doc ID space is quickly exhausted. However, if you only advance by small intervals, then you could end up with millions of calls to DocIdSetIterator#advance, so this tends to be the case worth optimizing. E.g. the recent change to skip data removed the higher levels of skip data that help advance by large increments, yet queries observed a speedup rather than a slowdown.

@gsmiller
Copy link
Contributor

gsmiller commented Aug 1, 2024

Ah yeah, OK thanks @jpountz. Makes sense.

@dungba88
Copy link
Contributor Author

dungba88 commented Aug 2, 2024

@jpountz I was reading IntArrayDocIdSetIterator, it is a private class only exposed through IntArrayDocIdSet. I think we need to extend the capability here (storing the score, having both score + doc ID instead of just doc ID). I think we can just re-implement the exponential search here. I can cut follow-up to move implementation to a common place, maybe ArrayUtil. WDYT?

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Aug 17, 2024
@dungba88
Copy link
Contributor Author

@jpountz I have changed the code to exponential search, and move the functionality to ArrayUtil. We still need two different method for int[] and generic array, as Java doesn't seem to allow genericize primitive types. Can you give some feedbacks on the new revision?

@github-actions github-actions bot removed the Stale label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants