-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified Highlighter way too slow for fragment_size > 0 #73569
Comments
The said code can be rewritten in such way that no loop is required, reducing the service time to the same level as of |
Pinging @elastic/es-search (Team:Search) |
That's a great finding @thelink2012 ! The current algorithm tries to expand the fragment greedily so there's lots of room for improvement. I like the approach in the patch, can you open a PR and explain a bit more the (possible) differences with your approach ? |
Hey @jimczi. Thanks for your reply. I have opened a PR with the changes and an explanation of the approach and possible differences in output to the current one. |
…ithm (#89041) As discussed in #73569 the current implementation is too slow in certain scenarios. The inefficient part of the code can be stated as the following problem: Given a text (getText()) and a position in this text (offset), find the sentence boundary before and after the offset, in such a way that the after boundary is maximal but respects end boundary - start boundary < fragment size. In case it's impossible to produce an after boundary that respects the said condition, use the nearest boundary following offset. The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. This new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).
Fixed by #89041 |
Elasticsearch version (
bin/elasticsearch --version
): 7.9.0 (also tested on latest: 7.13.0)Plugins installed: []
JVM version (
java -version
): 14.0.1OS version (
uname -a
if on a Unix-like system): Linux 5.8.0-53-generic 20.04.1-UbuntuDescription of the problem including expected versus actual behavior:
While investigating performance in our workload, we identified the unified highlighter as taking up to 95% of profiling samples for certain queries. Further investigation reveals this loop in
BoundedBreakIteratorScanner
as the culprit.This can be confirmed by passing
fragment_size: 0
to the highlighter such that it usesBreakIterator.getSentenceInstance()
unwrapped. Service time goes down to an acceptable level.Below we include a reproduction sample that (in my developer machine) can go from ~500ms to ~30ms by changing
fragment_size: 300
tofragment_size: 0
in the highlighting of a single document.Steps to reproduce:
A relatively big document is required for reproduction. You can find one here as
doc.json
. The attached archive also contains an automated reproduction script. Nevertheless, as follow are the manual steps:The
fragment_size: 0
query produces the following:The
fragment_size: 300
one produces:The text was updated successfully, but these errors were encountered: