Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

thelink2012 · 2022-08-02T15:33:25Z

This is PR #75306 (a fix for PR #73785) but in an user-repository such that Allow edits by maintainers is possible. As follows is a description of the original PR. See those other PRs for previous discussions.

As discussed in #73569 the current implementation is too slow in certain scenarios.

The inefficient part of the code can be stated as the following problem:

Given a text (getText()) and a position in this text (offset), find the sentence boundary before and after the offset, in such a way that the after boundary is maximal but respects end boundary - start boundary < fragment size.

In case it's impossible to produce an after boundary that respects the said condition, use the nearest boundary following offset.

The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive [Lucene noticed too].

The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).

In theory, both approaches should produce exactly the same outputs given the same (text, offset, fragment_size) tuple. But it doesn't. As far as my investigation went, BreakIterator doesn't seem to be commutative. Previous method calls affects other calls. So I guess whatever change we make to this algorithm, may produce results that differ from each other. It is rare, but happens. Would love to be wrong on this though.

Highlighting related unit and integration tests pass. Though I'm not passing in tests that seem completely unrelated to this. See below.

…e values

romseygeek · 2022-08-02T15:36:31Z

@elasticmachine ok to test

elasticsearchmachine · 2022-08-02T15:36:32Z

Pinging @elastic/es-search (Team:Search)

romseygeek · 2022-08-02T15:37:04Z

@elasticmachine generate changelog

romseygeek · 2022-08-02T16:28:31Z

@elasticmachine update branch

pugnascotia · 2022-08-03T09:38:20Z

@elasticmachine generate changelog

romseygeek

LGTM, thanks @thelink2012

thelink2012 · 2022-08-04T20:46:57Z

Hey @romseygeek. Thanks for the merge. Would it be possible for this patch to be backported to previous versions (at least 8.4)?

javanna · 2022-08-09T14:57:08Z

heya @thelink2012 we only backport bug-fixes to patch releases, and this does not qualify as one.

thelink2012 added 5 commits July 11, 2022 20:09

Improve performance of BoundedBreakIteratorScanner

14f832e

Better compatibility between old and new algorithm

14c0787

Remove unreachable condition in uhighlight context

7f7822b

Fix uhighlight fragmentation algorithm breaking on large fragment_siz…

cafdece

…e values

format

84b8b4c

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.5.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 2, 2022

thelink2012 mentioned this pull request Aug 2, 2022

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm (the 2nd) #75306

Closed

romseygeek self-assigned this Aug 2, 2022

romseygeek added >enhancement :Search Relevance/Highlighting How a query matched a document and removed needs:triage Requires assignment of a team area label labels Aug 2, 2022

elasticsearchmachine added the Team:Search Meta label for search team label Aug 2, 2022

Merge branch 'main' into denilson/improved-highlight-2

5094daf

pugnascotia assigned pugnascotia and unassigned romseygeek Aug 3, 2022

Add changelog

3fe32dd

pugnascotia assigned romseygeek and unassigned pugnascotia Aug 3, 2022

romseygeek approved these changes Aug 3, 2022

View reviewed changes

romseygeek merged commit 6bf5078 into elastic:main Aug 3, 2022

romseygeek mentioned this pull request Sep 23, 2022

Unified Highlighter way too slow for fragment_size > 0 #73569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

thelink2012 commented Aug 2, 2022

romseygeek commented Aug 2, 2022

elasticsearchmachine commented Aug 2, 2022

romseygeek commented Aug 2, 2022

romseygeek commented Aug 2, 2022

pugnascotia commented Aug 3, 2022

romseygeek left a comment

thelink2012 commented Aug 4, 2022

javanna commented Aug 9, 2022

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

Conversation

thelink2012 commented Aug 2, 2022

romseygeek commented Aug 2, 2022

elasticsearchmachine commented Aug 2, 2022

romseygeek commented Aug 2, 2022

romseygeek commented Aug 2, 2022

pugnascotia commented Aug 3, 2022

romseygeek left a comment

Choose a reason for hiding this comment

thelink2012 commented Aug 4, 2022

javanna commented Aug 9, 2022