-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785
Conversation
💚 CLA has been signed |
We're still working on the CLA thing. But back to the test issues, here's what I'm getting after running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment regarding an edge case but the logic looks good to me.
But back to the test issues, here's what I'm getting after running ./gradlew check -Dtests.haltonfailure=false . Seems to be related to distribution and documentation. Should I worry about it?
The integration tests for docs seem to fail. It could be a side effect of this change so the best is to look at the report that is linked after the error message:
Execution failed for task ':docs:integTest'.
We're still working on the CLA thing.
Thanks, don't hesitate to ping me on the PR when it's done.
In theory, both approaches should produce exactly the same outputs given the same (text, offset, fragment_size) tuple. But it doesn't
I am surprised too. I'll dig to see where that comes from.
server/src/main/java/org/apache/lucene/search/uhighlight/BoundedBreakIteratorScanner.java
Outdated
Show resolved
Hide resolved
1eb9f50
to
9c7a130
Compare
Happens on master too, before the patches. I rebased to 2c52127 and issue still occurs. Reports can be found here. It is related to a timeout in
Haven't tried increasing the timeout since 40 minutes is a lot already. |
Sorry for the slow reply. We don't see these timeouts in our build but I doubt that this change is responsible. In any case we'll trigger all tests in our CI before merging so if there's a problem we will see it. |
No problem. Sorry for the delay on the CLA too. I'll ping you once that's signed :) |
Pinging @elastic/es-search (Team:Search) |
Hey @jimczi. We've signed the CLA. Could you check? |
@elasticmachine update branch |
user doesn't have permission to update head repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thelink2012 , the change looks good to me.
Can you merge master in your branch so that I can trigger the tests with the latest changes ?
Done. |
@elasticmachine ok to test |
@thelink2012 can you merge master again ? Sorry for the back and forth but the last merge was made on a broken state. |
No problem. Merged. Let's see how the tests go :) |
@elasticmachine ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @thelink2012 !
…73785) The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).
…74898) The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1). Co-authored-by: Denilson das Mercês Amorim <denimorim@gmail.com>
@jimczi, looks like this change caused issues that were caught in Kibana testing (elastic/kibana#104466) - preventing us from using the nightly snapshots. I think it would be best to revert this PR to unblock downstream teams relying on an up-to-date Elasticsearch with Kibana. If you agree that is the correct approach, would you mind helping with a revert here? I am headed out on vacation right now, but will be back Monday and can assist with validating a new PR against Kibana going forward. |
@jimczi is currently out on vacation. I've reverted the relevant commits on master (154105f) / 7.x (e97c4ce) branches to unblock downstream teams. @thelink2012 can you open up another separate PR again for this where we can then iterate on the necessary fixes? |
Done, @ywelsch |
As discussed in #73569 the current implementation is too slow in certain scenarios.
The inefficient part of the code can be stated as the following problem:
The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but
BreakIterator
which is used to find each boundary is sometimes expensive [Lucene noticed too].The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing
start boundary
andoffset
, how many characters are left before resulting length isfragment size
). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned[offset, targetEndOffset]
, start fromtargetEndOffset + 1
).In theory, both approaches should produce exactly the same outputs given the same
(text, offset, fragment_size)
tuple. But it doesn't. As far as my investigation went,BreakIterator
doesn't seem to be commutative. Previous method calls affects other calls. So I guess whatever change we make to this algorithm, may produce results that differ from each other. It is rare, but happens. Would love to be wrong on this though.Highlighting related unit and integration tests pass. Though I'm not passing in tests that seem completely unrelated to this. See below.