[CI] Language analyzer docs failure #30557

romseygeek · 2018-05-14T07:45:33Z

These both reproduce:

REPRODUCE WITH: ./gradlew :docs:integTestRunner \
  -Dtests.seed=33404A59123B3635 \
  -Dtests.class=org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=reference/analysis/analyzers/lang-analyzer/line_1146}" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt \
  -Dtests.timezone=Australia/Sydney

REPRODUCE WITH: ./gradlew :docs:integTestRunner \
  -Dtests.seed=33404A59123B3635 \
  -Dtests.class=org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=reference/analysis/analyzers/lang-analyzer/line_373}" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt \
  -Dtests.timezone=Australia/Sydney

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-05-14T07:45:34Z

Pinging @elastic/es-search-aggs

cbuescher · 2018-05-18T10:14:32Z

This looks very much related to the changes made in #29535. Other seeds seem to run fine. I'm digging into the details but if @nik9000 has some ideas how to debug this efficiently, I'd appreciate a hint.

cbuescher · 2018-05-18T10:32:32Z

The first failure is related to the italian analyzers:

Failure at [reference/analysis/analyzers/lang-analyzer:1144]: text differs: italian was [𐒁𐒌𐒥𐒔] but rebuilt_italian was [d'e]. In utf8 those are
   > [f0 90 92 81 f0 90 92 8c f0 90 92 a5 f0 90 92 94] and
   > [64 27 65]

The second is the same input token, but relates to the catalan analyzer:

Failure at [reference/analysis/analyzers/lang-analyzer:352]: text differs: catalan was [𐒁𐒌𐒥𐒔] but rebuilt_catalan was [d'e]. In utf8 those are0
   > [f0 90 92 81 f0 90 92 8c f0 90 92 a5 f0 90 92 94] and
   > [64 27 65]

cbuescher · 2018-05-18T11:48:51Z

Its a bit tricky to debug this since the context if missing from the error, but I think I managed to isolate the part where the two analyzer outputs begin to differ. I can reproduce in Kibana, not sure if this copy/paste action preserves all "hidden" characters that the test string contains, but anyway:

PUT /italian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "italian_elision": {
          "type": "elision",
          "articles": [
                "c", "l", "all", "dall", "dell",
                "nell", "sull", "coll", "pell",
                "gl", "agl", "dagl", "degl", "negl",
                "sugl", "un", "m", "t", "s", "v", "d"
          ]
        },
        "italian_stop": {
          "type":       "stop",
          "stopwords":  "_italian_" 
        },
        "italian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["esempio"] 
        },
        "italian_stemmer": {
          "type":       "stemmer",
          "language":   "light_italian"
        }
      },
      "analyzer": {
        "rebuilt_italian": {
          "tokenizer":  "standard",
          "filter": [
            "italian_elision",
            "lowercase",
            "italian_stop",
            "italian_keywords",
            "italian_stemmer"
          ]
        }
      }
    }
  }
}

POST /test/_analyze
  {
  "analyzer" : "italian",
  "text" : "𐅣 D'e* ᧴᧱᧡, ﹢﹪ 𐒁𐒌𐒥𐒔 ջ՘՘զԲ԰ԽՙԻՕԴՇֈ 𐱍𐰕𐰬𐰪𐰯𐰕𐰉𐰜𐰁𐰫𐱉𐰅𐰾 ꩪꩤ꩹ꩤꩼ ᜦᜦᜫᜦ᜹, 𐡏𐡚𐡗𐡖𐡂𐡞𐡒𐡑      𐡞𐡌𐡗𐡄𐡁𐡓, ᇷᄒᇐᇽ취ᄸ"
}

POST /italian_example/_analyze
  {
  "analyzer" : "rebuilt_italian",
  "text" : "𐅣 D'e* ᧴᧱᧡, ﹢﹪ 𐒁𐒌𐒥𐒔 ջ՘՘զԲ԰ԽՙԻՕԴՇֈ 𐱍𐰕𐰬𐰪𐰯𐰕𐰉𐰜𐰁𐰫𐱉𐰅𐰾 ꩪꩤ꩹ꩤꩼ ᜦᜦᜫᜦ᜹, 𐡏𐡚𐡗𐡖𐡂𐡞𐡒𐡑      𐡞𐡌𐡗𐡄𐡁𐡓, ᇷᄒᇐᇽ취ᄸ"
}

The first analyzes to:

{
  "tokens": [
    {
      "token": "𐅣",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "𐒁𐒌𐒥𐒔",
      "start_offset": 16,
      "end_offset": 24,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "ջ",
      "start_offset": 25,
      "end_offset": 26,
      "type": "<ALPHANUM>",
      "position": 3
    }, [...]

The second to

{
  "tokens": [
    {
      "token": "𐅣",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "d'e",
      "start_offset": 3,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "𐒁𐒌𐒥𐒔",
      "start_offset": 16,
      "end_offset": 24,
      "type": "<ALPHANUM>",
      "position": 2
    }, [...]

So the original italian analyzer seems to swallow one more token. This part is surrounded by many chracters that seem to get dropped during analysis, which makes this also hard to debug.

jimczi · 2018-05-18T11:59:03Z

I think it's caused by the elision filter that is case insensitive in the built in analyzer and not in the rebuilt one. Adding "articles_case": true in all the elision filter of the rebuilt analyzer seems to solve the issue (this is already done for the french_rebuilt).

This commit fixes docs failure on language analyzers when compared to the built in analyzers. The `elision` filters used by the rebuilt language analyzers should be case insensitive to match the definition of the prebuilt analyzers. Closes elastic#30557

This commit fixes docs failure on language analyzers when compared to the built in analyzers. The `elision` filters used by the rebuilt language analyzers should be case insensitive to match the definition of the prebuilt analyzers. Closes #30557

romseygeek added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v6.3.0 labels May 14, 2018

romseygeek added the >test-failure Triaged test failures from CI label May 14, 2018

cbuescher self-assigned this May 18, 2018

jimczi mentioned this issue May 18, 2018

Fix docs failure on language analyzers #30722

Merged

jimczi closed this as completed in #30722 May 22, 2018

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Language analyzer docs failure #30557

[CI] Language analyzer docs failure #30557

romseygeek commented May 14, 2018

elasticmachine commented May 14, 2018

cbuescher commented May 18, 2018

cbuescher commented May 18, 2018

cbuescher commented May 18, 2018

jimczi commented May 18, 2018

[CI] Language analyzer docs failure #30557

[CI] Language analyzer docs failure #30557

Comments

romseygeek commented May 14, 2018

elasticmachine commented May 14, 2018

cbuescher commented May 18, 2018

cbuescher commented May 18, 2018

cbuescher commented May 18, 2018

jimczi commented May 18, 2018