-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Document how to rebuild analyzers #30498
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic | |
|
||
======================================== | ||
|
||
|
||
[float] | ||
=== Definition | ||
|
||
It consists of: | ||
|
||
Tokenizer:: | ||
* <<analysis-pattern-tokenizer,Pattern Tokenizer>> | ||
|
||
Token Filters:: | ||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>> | ||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default) | ||
|
||
[float] | ||
=== Example output | ||
|
||
|
@@ -378,3 +365,51 @@ The regex above is easier to understand as: | |
[\p{L}&&[^\p{Lu}]] # then lower case | ||
) | ||
-------------------------------------------------- | ||
|
||
[float] | ||
=== Definition | ||
|
||
The `pattern` anlayzer consists of: | ||
|
||
Tokenizer:: | ||
* <<analysis-pattern-tokenizer,Pattern Tokenizer>> | ||
|
||
Token Filters:: | ||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>> | ||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default) | ||
|
||
If you need to customize the `pattern` analyzer beyond the configuration | ||
parameters then you need to recreate it as a `custom` analyzer and modify | ||
it, usually by adding token filters. This would recreate the built in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "built in" -> "built-in" |
||
`pattern` analyzer and you can use it as a starting point for further | ||
customization: | ||
|
||
[source,js] | ||
---------------------------------------------------- | ||
PUT /pattern_example | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"tokenizer": { | ||
"split_on_non_word": { | ||
"type": "pattern", | ||
"stopwords": "\\W+" <1> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should it be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! Now I have to figure out how the tests passed when it was wrong.... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it is because |
||
} | ||
}, | ||
"analyzer": { | ||
"rebuilt_pattern": { | ||
"tokenizer": "split_on_non_word", | ||
"filter": [ | ||
"lowercase" <2> | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
---------------------------------------------------- | ||
// CONSOLE | ||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/] | ||
<1> The default pattern is `\W+` which splits on non-word characters | ||
and this is where you'd change it. | ||
<2> You'd add other token filters after `lowercase`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,14 +4,6 @@ | |
The `simple` analyzer breaks text into terms whenever it encounters a | ||
character which is not a letter. All terms are lower cased. | ||
|
||
[float] | ||
=== Definition | ||
|
||
It consists of: | ||
|
||
Tokenizer:: | ||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>> | ||
|
||
[float] | ||
=== Example output | ||
|
||
|
@@ -127,3 +119,37 @@ The above sentence would produce the following terms: | |
=== Configuration | ||
|
||
The `simple` analyzer is not configurable. | ||
|
||
[float] | ||
=== Definition | ||
|
||
The `simple` anlzyer consists of: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "anlzyer" -> "analyzer" |
||
|
||
Tokenizer:: | ||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>> | ||
|
||
If you need to customize the `simple` analyzer then you need to recreate | ||
it as a `custom` analyzer and modify it, usually by adding token filters. | ||
This would recreate the built in `simple` analyzer and you can use it as | ||
a starting point for further customization: | ||
|
||
[source,js] | ||
---------------------------------------------------- | ||
PUT /simple_example | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"rebuilt_simple": { | ||
"tokenizer": "lowercase", | ||
"filter": [ <1> | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
---------------------------------------------------- | ||
// CONSOLE | ||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/] | ||
<1> You'd add any token filters here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this his" -> "this" ?