Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for synthetic source #87416

Merged
merged 29 commits into from
Jun 9, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
b614bf2
Docs for synthetic source
nik9000 Jun 6, 2022
5d49927
numeric
nik9000 Jun 6, 2022
3653c84
Remaining fields
nik9000 Jun 6, 2022
c27bab7
Update docs/reference/mapping/fields/source-field.asciidoc
nik9000 Jun 7, 2022
241a1f8
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
aaf468f
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
e9ce9a9
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
cafa3cd
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
90e8cd0
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
bee3176
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
76093c0
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
4b1f802
Update docs/reference/mapping/fields/synthetic-source.asciidoc
nik9000 Jun 7, 2022
2579a01
Update docs/reference/mapping/types/boolean.asciidoc
nik9000 Jun 7, 2022
ef68f5d
Update docs/reference/mapping/types/geo-point.asciidoc
nik9000 Jun 7, 2022
67d476f
Update docs/reference/mapping/types/ip.asciidoc
nik9000 Jun 7, 2022
9398bf7
Update docs/reference/mapping/types/keyword.asciidoc
nik9000 Jun 7, 2022
3be7119
Update docs/reference/mapping/types/numeric.asciidoc
nik9000 Jun 7, 2022
801ddf5
Update docs/reference/mapping/types/text.asciidoc
nik9000 Jun 7, 2022
35936a4
Merge branch 'master' into synthetic_source_docs_1
nik9000 Jun 7, 2022
28cf6cf
Note on position increment gap
nik9000 Jun 7, 2022
955d73f
as specified by
nik9000 Jun 7, 2022
5962c27
Note
nik9000 Jun 7, 2022
95a384d
Sneaky
nik9000 Jun 7, 2022
ca4dde7
Fine, I'm convinced
nik9000 Jun 7, 2022
9b938e2
Update docs/reference/mapping/types/text.asciidoc
nik9000 Jun 7, 2022
b8eaa26
Merge branch 'master' into synthetic_source_docs_1
nik9000 Jun 8, 2022
812c170
Move ordering bit
nik9000 Jun 8, 2022
d99b6b4
Merge branch 'master' into synthetic_source_docs_1
nik9000 Jun 9, 2022
9664f05
Update
nik9000 Jun 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/reference/mapping/fields/source-field.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ at index time. The `_source` field itself is not indexed (and thus is not
searchable), but it is stored so that it can be returned when executing
_fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.

ifeval::["{release-state}"=="unreleased"]
include::synthetic-source.asciidoc[]
nik9000 marked this conversation as resolved.
Show resolved Hide resolved
endif::[]


[[disable-source-field]]
==== Disabling the `_source` field

Expand Down
161 changes: 161 additions & 0 deletions docs/reference/mapping/fields/synthetic-source.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
[[synthetic-source]]
==== Synthetic `_source`

Though very handy to have around, the source field takes up a fair amount of
space on disk. Instead of storing it on disk precisely as you sent it, Elasticsearch
can reconstruct it on the fly. Enable this by setting `synthetic: true` in `_source`:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console,id=enable-synthetic-source-example]
----
PUT idx
{
"mappings": {
"_source": {
"synthetic": true
javanna marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
----
// TESTSETUP

This on the fly reconstruction <<synthetic-source-modifications,modifies>> is *generally*
slower than saving the source precisely and loading it, but it saves a lot of space.
It also modifies the `_source` and is only supported if the index is entirely made up
of the following field types:

* <<boolean-synthetic-source,`boolean`>>
* <<numeric-synthetic-source,`byte`>>
* <<numeric-synthetic-source,`double`>>
* <<numeric-synthetic-source,`float`>>
* <<geo-point-synthetic-source,`geo_point`>>
* <<numeric-synthetic-source,`half_float`>>
* <<numeric-synthetic-source,`integer`>>
* <<ip-synthetic-source,`ip`>>
* <<keyword-synthetic-source,`keyword`>>
* <<numeric-synthetic-source,`long`>>
* <<numeric-synthetic-source,`scaled_float`>>
* <<numeric-synthetic-source,`short`>>
* <<text-synthetic-source,`text`>>
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[[synthetic-source-modifications]]
===== Synthetic source modifications
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[[synthetic-source-modifications-alphabetical]]
====== Sorts fields alphabetically
Synthetic source will make sort all fields alphabetically so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console,id=synthetic-source-sorted-example]
----
PUT idx/_doc/1
{
"foo": 1,
"bar": 2,
"baz": 3
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"bar": 2,
"baz": 3,
"foo": 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading this, as much as I understand why you mention it, keys ordering is never a guarantee in JSON land. I wonder if it's then needed to provide an example. Could we shorten it and say "You should never rely on keys ordering, but if you do beware you'll lose that with synthetic source as what you'll get back is not 100% what you sent"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that. I'm providing an example for everything so I like the.... balance? I don't know the right word. I could link to the spec and mention that ordering isn't supported by the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion but while the examples with arrays ought to be mentioned, taking this one out would make the docs page a little more compact without losing much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a tie breaker. @romseygeek, break a tie!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pressure :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @javanna, we don't guarantee key ordering anywhere so I don't think an example is necessary here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Democracy wins. I'll remove the example. I'll reduce this to a note.

----
// TEST[s/^/{"_source":/ s/\n$/}/]

[[synthetic-source-modifications-leaf-arrays]]
====== Moves arrays to leaf fields
Synthetic source will move all arrays to leaves so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console,id=synthetic-source-leaf-arrays-example]
----
PUT idx/_doc/1
{
"foo": [
{
"bar": 1
},
{
"bar": 2
}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"foo": {
"bar": [1, 2]
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

[[synthetic-source-modifications-field-names]]
====== Names fields as they are named in the mapping
Synthetic source will name fields as they are mapped. <<dynamic,dynamic mapping>>
defaults to interpreting fields with dots in their names as objects so dynamic
mapping will make documents as "objecty" as possible so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console,id=synthetic-source-objecty-example]
----
PUT idx/_doc/1
{
"foo.bar.baz": 1
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"foo": {
"bar": {
"baz": 1
}
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

You can keep dots in the names by settings <<subobjects>> to `false` so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console,id=synthetic-dot-example]
----
PUT idx
{
"mappings": {
"subobjects": false,
"_source": {
"synthetic": true
}
}
}

PUT idx/_doc/1
{
"foo.bar.baz": 1
}
----
// TEST[s/^/DELETE idx\n/]
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will stay:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved

[source,console-result]
----
{
"foo.bar.baz": 1
}
----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is specific of synthetic source about how fields are mapped: isn't this all about dynamic mappings? Did you mean to say that synthetic _source recreate the objects structure or not depending on how fields are mapped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It recreates the structure precisely as the objects are mapped. At first I tried to explicitly create a field mapped as foo.bar.baz but the mapping infrastructure unraveled it so I went with this. I'll make an example, one moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, check this out:

$ curl -uelastic:password -XDELETE localhost:9200/test
$ curl -uelastic:password -XPUT -HContent-Type:application/json localhost:9200/test -d'{
  "mappings": {
    "properties": {
      "foo.bar.baz": {
        "type": "keyword"
      }
    }
  }
}'
$ curl -uelastic:password localhost:9200/test/_mappings?pretty
{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"test"}{
  "test" : {
    "mappings" : {
      "properties" : {
        "foo" : {
          "properties" : {
            "bar" : {
              "properties" : {
                "baz" : {
                  "type" : "keyword"
                }
              }
            }
          }
        }
      }
    }
  }
}

Compare:

$ curl -uelastic:password -XDELETE localhost:9200/test
$ curl -uelastic:password -XPUT -HContent-Type:application/json localhost:9200/test -d'{
  "mappings": {
    "subobjects": false,
    "properties": {
      "foo.bar.baz": {
        "type": "keyword"
      }
    }
  }
}'
$ curl -uelastic:password localhost:9200/test/_mappings?pretty
{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"test"}{
  "test" : {
    "mappings" : {
      "subobjects" : false,
      "properties" : {
        "foo.bar.baz" : {
          "type" : "keyword"
        }
      }
    }
  }
}

So I wrote the example this way because it was the shortest way to get objects with dots in the name. But maybe it's unclear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess I was wondering why this needs to be specifically explained, isn't it the expected behaviour? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it'd be useful to explain "this does the right thing with that thing Luca just built". But maybe it's not worth it because it's not something folks want much?

Copy link
Member

@javanna javanna Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, maybe this is not so obvious :)
One thing to mention could be that you always recreate the object structure despite dots in fields names were provided in the first place (unless subobjects are disabled). Why not always flatten by the way? After all, there are two ways to provide that document and that leads to the same mapping, you have 50% chances to pick the variant that was sent in the first place.

By the way we may start accepting docs with subobjects even when subobjects are disabled and treat them like dotted fields, but I think the current behaviour of "flattening" would still be good in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! I removed the section!

Why not always flatten by the way?

We asked some kibana friends and they said folks liked things shaped this way instead of flattened. The objects feel right to folks. Also, I figured if we were guessing anyway the "more objecty" approach was probably more likely to match what users sent. I figured flattened fields with dots in them is more rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me. I think you removed only the section with subobjects false and you could make the remaining section about recreating objects shorter by saying that you follow the mappings structure hence prefer nested objects over dots in fields names but the opposite when subobjects are disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you removed only the section with subobjects false and you could make the remaining section about recreating objects shorter by saying that you follow the mappings structure hence prefer nested objects over dots in fields names but the opposite when subobjects are disabled.

Two things:

  1. Don't use the word nested unless you mean it. I get scared.
  2. Do you want me to change the words above? Could you suggest a change? I think the "dynamic mappings work like BLAH" bit is useful because it explains where those objects come from. I don't mean to explain all of dynamic mappings, just enough to give folks a hint.

// TEST[s/^/{"_source":/ s/\n$/}/]
36 changes: 36 additions & 0 deletions docs/reference/mapping/types/boolean.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,39 @@ The following parameters are accepted by `boolean` fields:
<<mapping-field-meta,`meta`>>::

Metadata about the field.

ifeval::["{release-state}"=="unreleased"]
[[boolean-synthetic-source]]
==== Synthetic source
`boolean` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. But they don't support synthetic `_source` if you add
<<copy-to,`copy_to`>> or if you disable <<doc-values,`doc_values`>>.

Synthetic source will always sort `boolean` fields so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved
[source,console,id=synthetic-source-boolean-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"bool": { "type": "boolean" }
}
}
}
PUT idx/_doc/1
{
"bool": [true, false, true, false]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:
[source,console-result]
----
{
"bool": [false, false, true, true]
javanna marked this conversation as resolved.
Show resolved Hide resolved
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
endif::[]
44 changes: 44 additions & 0 deletions docs/reference/mapping/types/geo-point.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -203,3 +203,47 @@ For performance reasons, it is better to access the lat/lon values directly:
def lat = doc['location'].lat;
def lon = doc['location'].lon;
--------------------------------------------------

ifeval::["{release-state}"=="unreleased"]
[[geo-point-synthetic-source]]
==== Synthetic source
`geo_point` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. But they don't support synthetic `_source` if you add
<<ignore-malformed,`ignore_malformed`>>, add <<copy-to,`copy_to`>>, or if you
disable <<doc-values,`doc_values`>>.

Synthetic source will always sort `geo_point` fields (by latitude then longitude)
and reduce them to their stored precision. So:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved
[source,console,id=synthetic-source-geo-point-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"point": { "type": "geo_point" }
}
}
}
PUT idx/_doc/1
{
"point": [
{"lat":-90, "lon":-80},
{"lat":10, "lon":30}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:
[source,console-result]
----
{
"point": [
{"lat":-90.0, "lon":-80.00000000931323},
{"lat":9.999999990686774, "lon":29.999999972060323}
]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
endif::[]
43 changes: 43 additions & 0 deletions docs/reference/mapping/types/ip.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,46 @@ GET my-index-000001/_search
}
}
--------------------------------------------------

ifeval::["{release-state}"=="unreleased"]
[[ip-synthetic-source]]
==== Synthetic source
`ip` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. But they don't support synthetic `_source` if you add
<<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or if you
disable <<doc-values,`doc_values`>>.

Synthetic source will always sort `ip` fields and remove duplicates so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved
[source,console,id=synthetic-source-ip-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"ip": { "type": "ip" }
}
}
}
PUT idx/_doc/1
{
"ip": ["192.168.0.1", "192.168.0.1", "10.10.12.123",
"2001:db8::1:0:0:1", "::afff:4567:890a"]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"ip": ["::afff:4567:890a", "10.10.12.123", "192.168.0.1", "2001:db8::1:0:0:1"]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by
`::ffff:0:0:0/96` ala https://datatracker.ietf.org/doc/html/rfc6144[rfc6144].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by
`::ffff:0:0:0/96` ala https://datatracker.ietf.org/doc/html/rfc6144[rfc6144].
NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by
`::ffff:0:0:0/96`, following
https://datatracker.ietf.org/doc/html/rfc6144[rfc6144] guidelines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd call this a "guideline". Maybe I should do, but I think of it more like a "regime" or "as specified by rfc6144.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as specified by is perfect!


endif::[]
39 changes: 39 additions & 0 deletions docs/reference/mapping/types/keyword.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,45 @@ Dimension fields have the following constraints:
====
--

ifeval::["{release-state}"=="unreleased"]
[[keyword-synthetic-source]]
==== Synthetic source
`keyword` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. But they don't support synthetic `_source` if you add
<<ignore-above,`ignore_above`>>, a <<normalizer,`normalizer`>>,
<<copy-to,`copy_to`>>, or if you disable <<doc-values,`doc_values`>>.

Synthetic source will always sort `keyword` fields and remove duplicates so:
nik9000 marked this conversation as resolved.
Show resolved Hide resolved
[source,console,id=synthetic-source-keyword-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"kwd": { "type": "keyword" }
}
}
}
PUT idx/_doc/1
{
"kwd": ["foo", "foo", "bar", "baz"]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"kwd": ["bar", "baz", "foo"]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

endif::[]

include::constant-keyword.asciidoc[]

include::wildcard.asciidoc[]
Loading