Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

thelink2012 · 2021-06-04T19:12:24Z

As discussed in #73569 the current implementation is too slow in certain scenarios.

The inefficient part of the code can be stated as the following problem:

Given a text (getText()) and a position in this text (offset), find the sentence boundary before and after the offset, in such a way that the after boundary is maximal but respects end boundary - start boundary < fragment size.

In case it's impossible to produce an after boundary that respects the said condition, use the nearest boundary following offset.

The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive [Lucene noticed too].

The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).

In theory, both approaches should produce exactly the same outputs given the same (text, offset, fragment_size) tuple. But it doesn't. As far as my investigation went, BreakIterator doesn't seem to be commutative. Previous method calls affects other calls. So I guess whatever change we make to this algorithm, may produce results that differ from each other. It is rare, but happens. Would love to be wrong on this though.

Highlighting related unit and integration tests pass. Though I'm not passing in tests that seem completely unrelated to this. See below.

cla-checker-service · 2021-06-04T19:12:27Z

💚 CLA has been signed

thelink2012 · 2021-06-04T19:17:44Z

We're still working on the CLA thing. But back to the test issues, here's what I'm getting after running ./gradlew check -Dtests.haltonfailure=false . Seems to be related to distribution and documentation. Should I worry about it?

jimczi

I left one comment regarding an edge case but the logic looks good to me.

But back to the test issues, here's what I'm getting after running ./gradlew check -Dtests.haltonfailure=false . Seems to be related to distribution and documentation. Should I worry about it?

The integration tests for docs seem to fail. It could be a side effect of this change so the best is to look at the report that is linked after the error message:
Execution failed for task ':docs:integTest'.

We're still working on the CLA thing.

Thanks, don't hesitate to ping me on the PR when it's done.

In theory, both approaches should produce exactly the same outputs given the same (text, offset, fragment_size) tuple. But it doesn't

I am surprised too. I'll dig to see where that comes from.

server/src/main/java/org/apache/lucene/search/uhighlight/BoundedBreakIteratorScanner.java

thelink2012 · 2021-06-11T14:54:40Z

The integration tests for docs seem to fail. It could be a side effect of this change so the best is to look at the report that is linked after the error message:

Happens on master too, before the patches. I rebased to 2c52127 and issue still occurs. Reports can be found here. It is related to a timeout in DocsClientYamlTestSuiteIT:

java.lang.Exception: Suite timeout exceeded (>= 2400000 msec).
	at __randomizedtesting.SeedInfo.seed([7C03B7D75C3D1D4D]:0)

Haven't tried increasing the timeout since 40 minutes is a lot already.

jimczi · 2021-06-17T22:17:41Z

Sorry for the slow reply. We don't see these timeouts in our build but I doubt that this change is responsible. In any case we'll trigger all tests in our CI before merging so if there's a problem we will see it.
Any news on the CLA front ?

thelink2012 · 2021-06-21T18:23:41Z

No problem. Sorry for the delay on the CLA too. I'll ping you once that's signed :)

elasticmachine · 2021-06-25T03:56:09Z

Pinging @elastic/es-search (Team:Search)

thelink2012 · 2021-06-30T17:52:29Z

Hey @jimczi. We've signed the CLA. Could you check?

jimczi · 2021-07-01T09:12:39Z

@elasticmachine update branch

elasticmachine · 2021-07-01T09:12:40Z

user doesn't have permission to update head repository

jimczi

Thanks @thelink2012 , the change looks good to me.
Can you merge master in your branch so that I can trigger the tests with the latest changes ?

thelink2012 · 2021-07-01T17:22:19Z

Done.

jimczi · 2021-07-02T05:45:39Z

@elasticmachine ok to test

jimczi · 2021-07-02T06:36:37Z

@thelink2012 can you merge master again ? Sorry for the back and forth but the last merge was made on a broken state.

thelink2012 · 2021-07-02T14:32:42Z

No problem. Merged. Let's see how the tests go :)

jimczi · 2021-07-05T07:12:32Z

@elasticmachine ok to test

jimczi

LGTM, thanks @thelink2012 !

…73785) The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).

…74898) The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1). Co-authored-by: Denilson das Mercês Amorim <denimorim@gmail.com>

tylersmalley · 2021-07-13T13:55:47Z

@jimczi, looks like this change caused issues that were caught in Kibana testing (elastic/kibana#104466) - preventing us from using the nightly snapshots. I think it would be best to revert this PR to unblock downstream teams relying on an up-to-date Elasticsearch with Kibana. If you agree that is the correct approach, would you mind helping with a revert here? I am headed out on vacation right now, but will be back Monday and can assist with validating a new PR against Kibana going forward.

…73785)" This reverts commit 1fc09f9.

…73785) (#74898)" This reverts commit 9efc37e.

ywelsch · 2021-07-13T14:10:24Z

@jimczi is currently out on vacation. I've reverted the relevant commits on master (154105f) / 7.x (e97c4ce) branches to unblock downstream teams.

@thelink2012 can you open up another separate PR again for this where we can then iterate on the necessary fixes?

thelink2012 · 2021-07-13T14:39:05Z

Done, @ywelsch

elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Jun 4, 2021

jimczi reviewed Jun 9, 2021

View reviewed changes

server/src/main/java/org/apache/lucene/search/uhighlight/BoundedBreakIteratorScanner.java Outdated Show resolved Hide resolved

thelink2012 added 3 commits June 11, 2021 11:41

Improve performance of BoundedBreakIteratorScanner

147d5ab

Better compatibility between old and new algorithm

7f0cb56

Remove unreachable condition in uhighlight context

9c7a130

thelink2012 force-pushed the denilson/improved-highlight branch from 1eb9f50 to 9c7a130 Compare June 11, 2021 14:42

mark-vieira added the :Search/Search Search-related issues that do not fall into other categories label Jun 25, 2021

elasticmachine added the Team:Search Meta label for search team label Jun 25, 2021

jimczi added >enhancement v7.15.0 labels Jul 1, 2021

jimczi reviewed Jul 1, 2021

View reviewed changes

Merge branch 'master' into denilson/improved-highlight

3c4d755

thelink2012 requested a review from jimczi July 1, 2021 17:23

Merge branch 'master' into denilson/improved-highlight

b4eda90

jimczi approved these changes Jul 5, 2021

View reviewed changes

jimczi merged commit 1fc09f9 into elastic:master Jul 5, 2021

jimczi mentioned this pull request Jul 5, 2021

Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785) #74898

Merged

ywelsch added a commit that referenced this pull request Jul 13, 2021

Revert "Improve BoundedBreakIteratorScanner fragmentation algorithm (#…

154105f

…73785)" This reverts commit 1fc09f9.

ywelsch added a commit that referenced this pull request Jul 13, 2021

Revert "Improve BoundedBreakIteratorScanner fragmentation algorithm (#…

e97c4ce

…73785) (#74898)" This reverts commit 9efc37e.

ywelsch removed the v7.15.0 label Jul 13, 2021

thelink2012 mentioned this pull request Jul 13, 2021

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm (the 2nd) #75306

Closed

thelink2012 mentioned this pull request Jul 13, 2022

Failing ES Promotion: discover app discover tab field data search php should show the correct hit count elastic/kibana#104466

Closed

thelink2012 mentioned this pull request Aug 2, 2022

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

thelink2012 commented Jun 4, 2021

cla-checker-service bot commented Jun 4, 2021 •

edited

Loading

thelink2012 commented Jun 4, 2021 •

edited

Loading

jimczi left a comment

thelink2012 commented Jun 11, 2021

jimczi commented Jun 17, 2021

thelink2012 commented Jun 21, 2021

elasticmachine commented Jun 25, 2021

thelink2012 commented Jun 30, 2021

jimczi commented Jul 1, 2021

elasticmachine commented Jul 1, 2021

jimczi left a comment

thelink2012 commented Jul 1, 2021

jimczi commented Jul 2, 2021

jimczi commented Jul 2, 2021

thelink2012 commented Jul 2, 2021 •

edited

Loading

jimczi commented Jul 5, 2021

jimczi left a comment

tylersmalley commented Jul 13, 2021 •

edited

Loading

ywelsch commented Jul 13, 2021

thelink2012 commented Jul 13, 2021

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

Conversation

thelink2012 commented Jun 4, 2021

cla-checker-service bot commented Jun 4, 2021 • edited Loading

thelink2012 commented Jun 4, 2021 • edited Loading

jimczi left a comment

Choose a reason for hiding this comment

thelink2012 commented Jun 11, 2021

jimczi commented Jun 17, 2021

thelink2012 commented Jun 21, 2021

elasticmachine commented Jun 25, 2021

thelink2012 commented Jun 30, 2021

jimczi commented Jul 1, 2021

elasticmachine commented Jul 1, 2021

jimczi left a comment

Choose a reason for hiding this comment

thelink2012 commented Jul 1, 2021

jimczi commented Jul 2, 2021

jimczi commented Jul 2, 2021

thelink2012 commented Jul 2, 2021 • edited Loading

jimczi commented Jul 5, 2021

jimczi left a comment

Choose a reason for hiding this comment

tylersmalley commented Jul 13, 2021 • edited Loading

ywelsch commented Jul 13, 2021

thelink2012 commented Jul 13, 2021

cla-checker-service bot commented Jun 4, 2021 •

edited

Loading

thelink2012 commented Jun 4, 2021 •

edited

Loading

thelink2012 commented Jul 2, 2021 •

edited

Loading

tylersmalley commented Jul 13, 2021 •

edited

Loading