Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Document how to rebuild analyzers #30498

Merged
merged 5 commits into from
May 14, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 43 additions & 14 deletions docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token. If a stopword list is
configured, stop words will also be removed.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
2. <<analysis-asciifolding-tokenfilter>>
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
4. <<analysis-fingerprint-tokenfilter>>

[float]
=== Example output

Expand Down Expand Up @@ -149,3 +135,46 @@ The above example produces the following term:
---------------------------
[ consistent godel said sentence yes ]
---------------------------

[float]
=== Definition

The `fingerprint` tokenizer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-asciifolding-tokenfilter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>

If you need to customize the `fingerprint` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`fingerprint` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
45 changes: 37 additions & 8 deletions docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
string as a single token.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>

[float]
=== Example output

Expand Down Expand Up @@ -57,3 +49,40 @@ The above sentence would produce the following single term:
=== Configuration

The `keyword` analyzer is not configurable.

[float]
=== Definition

The `keyword` analyzer consists of:

Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>

If you need to customize the `keyword` analyzer then you need to
recreate it as a `custom` analyzer and modify it, usually by adding
token filters. Usually, you should prefer the
<<keyword, Keyword type>> when you want strings that are not split
into tokens, but just in case you need it, this his would recreate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this his" -> "this" ?

the built in `keyword` analyzer and you can use it as a starting
point for further customization:

[source,js]
----------------------------------------------------
PUT /keyword_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_keyword": {
"tokenizer": "keyword",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
<1> You'd add any token filters here.
61 changes: 48 additions & 13 deletions docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic

========================================


[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

[float]
=== Example output

Expand Down Expand Up @@ -378,3 +365,51 @@ The regex above is easier to understand as:
[\p{L}&&[^\p{Lu}]] # then lower case
)
--------------------------------------------------

[float]
=== Definition

The `pattern` anlayzer consists of:

Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

If you need to customize the `pattern` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"built in" -> "built-in"
in this place and in all other places

`pattern` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"stopwords": "\\W+" <1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be pattern instead of stopwords?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Now I have to figure out how the tests passed when it was wrong....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is because \\W+ is the default. And because we don't complain if you pass extra stuff here.

}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
<1> The default pattern is `\W+` which splits on non-word characters
and this is where you'd change it.
<2> You'd add other token filters after `lowercase`.
42 changes: 34 additions & 8 deletions docs/reference/analysis/analyzers/simple-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@
The `simple` analyzer breaks text into terms whenever it encounters a
character which is not a letter. All terms are lower cased.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

[float]
=== Example output

Expand Down Expand Up @@ -127,3 +119,37 @@ The above sentence would produce the following terms:
=== Configuration

The `simple` analyzer is not configurable.

[float]
=== Definition

The `simple` anlzyer consists of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"anlzyer" -> "analyzer"


Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

If you need to customize the `simple` analyzer then you need to recreate
it as a `custom` analyzer and modify it, usually by adding token filters.
This would recreate the built in `simple` analyzer and you can use it as
a starting point for further customization:

[source,js]
----------------------------------------------------
PUT /simple_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_simple": {
"tokenizer": "lowercase",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
<1> You'd add any token filters here.
54 changes: 41 additions & 13 deletions docs/reference/analysis/analyzers/standard-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

[float]
=== Example output

Expand Down Expand Up @@ -276,3 +263,44 @@ The above example produces the following terms:
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------

[float]
=== Definition

The `standard` analyzer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

If you need to customize the `standard` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`standard` analyzer and you can use it as a starting point:

[source,js]
----------------------------------------------------
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase" <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
<1> You'd add any token filters after `lowercase`.
58 changes: 47 additions & 11 deletions docs/reference/analysis/analyzers/stop-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
but adds support for removing stop words. It defaults to using the
`_english_` stop words.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>

[float]
=== Example output

Expand Down Expand Up @@ -239,3 +228,50 @@ The above example produces the following terms:
---------------------------
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
---------------------------

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>

If you need to customize the `stop` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`stop` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /stop_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_" <1>
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
<1> The default stopwords can be overridden with the `stopwords`
or `stopwords_path` parameters.
<2> You'd add any token filters after `english_stop`.
Loading