Unified Highlighter way too slow for fragment_size > 0 #73569

thelink2012 · 2021-05-31T16:23:06Z

Elasticsearch version (bin/elasticsearch --version): 7.9.0 (also tested on latest: 7.13.0)

Plugins installed: []

JVM version (java -version): 14.0.1

OS version (uname -a if on a Unix-like system): Linux 5.8.0-53-generic 20.04.1-Ubuntu

Description of the problem including expected versus actual behavior:

While investigating performance in our workload, we identified the unified highlighter as taking up to 95% of profiling samples for certain queries. Further investigation reveals this loop in BoundedBreakIteratorScanner as the culprit.

This can be confirmed by passing fragment_size: 0 to the highlighter such that it uses BreakIterator.getSentenceInstance() unwrapped. Service time goes down to an acceptable level.

Below we include a reproduction sample that (in my developer machine) can go from ~500ms to ~30ms by changing fragment_size: 300 to fragment_size: 0 in the highlighting of a single document.

Steps to reproduce:

A relatively big document is required for reproduction. You can find one here as doc.json. The attached archive also contains an automated reproduction script. Nevertheless, as follow are the manual steps:

# 1) Create the index
curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/sample-index" --data '
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "index_options": "offsets"
      }
    }
  }
}'

# 2) Populate with certain documents
curl -H "Content-Type: application/json" -XPOST "http://localhost:9200/sample-index/_doc" --data @doc.json
curl -XPOST "http://localhost:39200/sample-index/_refresh"

# 3) Highlight with fragment_size: 0 (should be fast)
curl -H "Content-Type: application/json" -XGET "http://localhost:9200/sample-index/_search?request_cache=false&_source=false" -d '
{
  "query": {
    "query_string": {
      "query": "ter pelo no ser para"
    }
  },
  "highlight": {
    "fields": {
      "content": {
	"number_of_fragments": 1,
	"fragment_size": 0,
	"type": "unified"
      }
    }
  }
}'

# 4) Highlight with fragment_size: 300 (slow)
curl -H "Content-Type: application/json" -XGET "http://localhost:9200/sample-index/_search?request_cache=false&_source=false" -d '
{
  "query": {
    "query_string": {
      "query": "ter pelo no ser para"
    }
  },
  "highlight": {
    "fields": {
      "content": {
	"number_of_fragments": 1,
	"fragment_size": 300,
	"type": "unified"
      }
    }
  }
}'

The fragment_size: 0 query produces the following:

{"took":38,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":3.1488662,"hits":[{"_index":"sample-index","_type":"_doc","_id":"zv7iwnkBFh9839cGbK6f","_score":3.1488662,"highlight":{"content":["Se comprou e pagou tem que <em>ter</em> entrada <em>no</em> almoxarifado, seja <em>para</em> estoque ou <em>para</em> consumo."]}}]}}

The fragment_size: 300 one produces:

{"took":595,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":3.1488662,"hits":[{"_index":"sample-index","_type":"_doc","_id":"zv7iwnkBFh9839cGbK6f","_score":3.1488662,"highlight":{"content":["E aí <em>no</em> caso das notas fiscais, eu não lembro quais os números, mas umas eram só <em>pelo</em> Secretário e outras <em>pelo</em> (...); aí eu até questionei aí, porque poderia também <em>ser</em> assinada às vezes, <em>ter</em> sido nomeado um gestor do contrato dessa adesão."]}}]}}

The text was updated successfully, but these errors were encountered:

thelink2012 · 2021-05-31T16:23:23Z

The said code can be rewritten in such way that no loop is required, reducing the service time to the same level as of fragment_size: 0. We'd be glad to submit this patch upstream if the issue gets considered as valid.

elasticmachine · 2021-06-02T10:12:28Z

Pinging @elastic/es-search (Team:Search)

jimczi · 2021-06-02T11:23:42Z

That's a great finding @thelink2012 ! The current algorithm tries to expand the fragment greedily so there's lots of room for improvement. I like the approach in the patch, can you open a PR and explain a bit more the (possible) differences with your approach ?

thelink2012 · 2021-06-04T19:22:43Z

Hey @jimczi. Thanks for your reply. I have opened a PR with the changes and an explanation of the approach and possible differences in output to the current one.

…ithm (#89041) As discussed in #73569 the current implementation is too slow in certain scenarios. The inefficient part of the code can be stated as the following problem: Given a text (getText()) and a position in this text (offset), find the sentence boundary before and after the offset, in such a way that the after boundary is maximal but respects end boundary - start boundary < fragment size. In case it's impossible to produce an after boundary that respects the said condition, use the nearest boundary following offset. The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive. This new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).

romseygeek · 2022-09-23T12:39:32Z

Fixed by #89041

thelink2012 added >bug needs:triage Requires assignment of a team area label labels May 31, 2021

iverase added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jun 2, 2021

elasticmachine added the Team:Search Meta label for search team label Jun 2, 2021

iverase added :Search Relevance/Highlighting How a query matched a document and removed :Search/Search Search-related issues that do not fall into other categories labels Jun 2, 2021

thelink2012 mentioned this issue Jun 4, 2021

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #73785

Merged

thelink2012 mentioned this issue Jul 13, 2021

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm (the 2nd) #75306

Closed

thelink2012 mentioned this issue Aug 2, 2022

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm #89041

Merged

romseygeek closed this as completed Sep 23, 2022

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Highlighter way too slow for fragment_size > 0 #73569

Unified Highlighter way too slow for fragment_size > 0 #73569

thelink2012 commented May 31, 2021

thelink2012 commented May 31, 2021 •

edited

Loading

elasticmachine commented Jun 2, 2021

jimczi commented Jun 2, 2021

thelink2012 commented Jun 4, 2021

romseygeek commented Sep 23, 2022

Unified Highlighter way too slow for fragment_size > 0 #73569

Unified Highlighter way too slow for fragment_size > 0 #73569

Comments

thelink2012 commented May 31, 2021

thelink2012 commented May 31, 2021 • edited Loading

elasticmachine commented Jun 2, 2021

jimczi commented Jun 2, 2021

thelink2012 commented Jun 4, 2021

romseygeek commented Sep 23, 2022

thelink2012 commented May 31, 2021 •

edited

Loading