Skip to content

Commit

Permalink
Docs for synthetic source (#87416)
Browse files Browse the repository at this point in the history
This adds some basic docs for synthetic source both to get us started
documenting it and to show how I'd like to get it documented - with a
central section in the docs for `_source` and "satellite" sections in
each of the supported field types that link back to the central section.

[Preview](https://elasticsearch_87416.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/mapping-source-field.html#synthetic-source)
  • Loading branch information
nik9000 authored Jun 9, 2022
1 parent 09d7e45 commit b18bafb
Show file tree
Hide file tree
Showing 8 changed files with 419 additions and 0 deletions.
11 changes: 11 additions & 0 deletions docs/reference/mapping/fields/source-field.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ at index time. The `_source` field itself is not indexed (and thus is not
searchable), but it is stored so that it can be returned when executing
_fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.

ifeval::["{release-state}"=="unreleased"]
If disk usage is important to you then have a look at
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of
only supporting a subset of mappings and slower fetches or (not recommended)
<<disable-source-field,disabling the `_source` field>> which also shrinks disk
usage but disables many features.

include::synthetic-source.asciidoc[]
endif::[]


[[disable-source-field]]
==== Disabling the `_source` field

Expand Down
120 changes: 120 additions & 0 deletions docs/reference/mapping/fields/synthetic-source.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
[[synthetic-source]]
==== Synthetic `_source`

Though very handy to have around, the source field takes up a significant amount
of space on disk. Instead of storing source documents on disk exactly as you
send them, Elasticsearch can reconstruct source content on the fly upon retrieval.
Enable this by setting `synthetic: true` in `_source`:

[source,console,id=enable-synthetic-source-example]
----
PUT idx
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
----
// TESTSETUP

While this on the fly reconstruction is *generally* slower than saving the source
documents verbatim and loading them at query time, it saves a lot of storage
space. There are a couple of restrictions to be aware of:

* When you retrieve synthetic `_source` content it undergoes minor
<<synthetic-source-modifications,modifications>> compared to the original JSON.
* Synthetic `_source` can be used with indices that contain only these field
types:

** <<boolean-synthetic-source,`boolean`>>
** <<numeric-synthetic-source,`byte`>>
** <<numeric-synthetic-source,`double`>>
** <<numeric-synthetic-source,`float`>>
** <<geo-point-synthetic-source,`geo_point`>>
** <<numeric-synthetic-source,`half_float`>>
** <<numeric-synthetic-source,`integer`>>
** <<ip-synthetic-source,`ip`>>
** <<keyword-synthetic-source,`keyword`>>
** <<numeric-synthetic-source,`long`>>
** <<numeric-synthetic-source,`scaled_float`>>
** <<numeric-synthetic-source,`short`>>
** <<text-synthetic-source,`text`>> (with a `keyword` sub-field)

[[synthetic-source-modifications]]
===== Synthetic source modifications

When synthetic `_source` is enabled, retrieved documents undergo some
modifications compared to the original JSON.

[[synthetic-source-modifications-leaf-arrays]]
====== Arrays moved to leaf fields
Synthetic `_source` arrays are moved to leaves. For example:

[source,console,id=synthetic-source-leaf-arrays-example]
----
PUT idx/_doc/1
{
"foo": [
{
"bar": 1
},
{
"bar": 2
}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"foo": {
"bar": [1, 2]
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

[[synthetic-source-modifications-field-names]]
====== Fields named as they are mapped
Synthetic source names fields as they are named in the mapping. When used
with <<dynamic,dynamic mapping>>, fields with dots (`.`) in their names are, by
default, interpreted as multiple objects, while dots in field names are
preserved within objects that have <<subobjects>> disabled. For example:

[source,console,id=synthetic-source-objecty-example]
----
PUT idx/_doc/1
{
"foo.bar.baz": 1
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"foo": {
"bar": {
"baz": 1
}
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

[[synthetic-source-modifications-alphabetical]]
====== Alphabetical sorting
Synthetic `_source` fields are sorted alphabetically. The
https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as
"an unordered collection of zero or more name/value pairs" so applications
shouldn't care but without synthetic `_source` the original ordering is
preserved and some applications may, counter to the spec, do something with
that ordering.
36 changes: 36 additions & 0 deletions docs/reference/mapping/types/boolean.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,39 @@ The following parameters are accepted by `boolean` fields:
<<mapping-field-meta,`meta`>>::

Metadata about the field.

ifeval::["{release-state}"=="unreleased"]
[[boolean-synthetic-source]]
==== Synthetic source
`boolean` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. Synthetic `_source` cannot be used together with
<<copy-to,`copy_to`>> or with <<doc-values,`doc_values`>> disabled.

Synthetic source always sorts `boolean` fields. For example:
[source,console,id=synthetic-source-boolean-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"bool": { "type": "boolean" }
}
}
}
PUT idx/_doc/1
{
"bool": [true, false, true, false]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:
[source,console-result]
----
{
"bool": [false, false, true, true]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
endif::[]
44 changes: 44 additions & 0 deletions docs/reference/mapping/types/geo-point.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -203,3 +203,47 @@ For performance reasons, it is better to access the lat/lon values directly:
def lat = doc['location'].lat;
def lon = doc['location'].lon;
--------------------------------------------------

ifeval::["{release-state}"=="unreleased"]
[[geo-point-synthetic-source]]
==== Synthetic source
`geo_point` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. Synthetic `_source` cannot be used together with
<<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or with
<<doc-values,`doc_values`>> disabled.

Synthetic source always sorts `geo_point` fields (first by latitude and then
longitude) and reduces them to their stored precision. For example:
[source,console,id=synthetic-source-geo-point-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"point": { "type": "geo_point" }
}
}
}
PUT idx/_doc/1
{
"point": [
{"lat":-90, "lon":-80},
{"lat":10, "lon":30}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:
[source,console-result]
----
{
"point": [
{"lat":-90.0, "lon":-80.00000000931323},
{"lat":9.999999990686774, "lon":29.999999972060323}
]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
endif::[]
44 changes: 44 additions & 0 deletions docs/reference/mapping/types/ip.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,47 @@ GET my-index-000001/_search
}
}
--------------------------------------------------

ifeval::["{release-state}"=="unreleased"]
[[ip-synthetic-source]]
==== Synthetic source
`ip` fields support <<synthetic-source,synthetic `_source`>> in their default
configuration. Synthetic `_source` cannot be used together with
<<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or with
<<doc-values,`doc_values`>> disabled.

Synthetic source always sorts `ip` fields and removes duplicates. For example:
[source,console,id=synthetic-source-ip-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"ip": { "type": "ip" }
}
}
}
PUT idx/_doc/1
{
"ip": ["192.168.0.1", "192.168.0.1", "10.10.12.123",
"2001:db8::1:0:0:1", "::afff:4567:890a"]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"ip": ["::afff:4567:890a", "10.10.12.123", "192.168.0.1", "2001:db8::1:0:0:1"]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by
`::ffff:0:0:0/96` as specified by
https://datatracker.ietf.org/doc/html/rfc6144[rfc6144].

endif::[]
40 changes: 40 additions & 0 deletions docs/reference/mapping/types/keyword.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,46 @@ Dimension fields have the following constraints:
====
--

ifeval::["{release-state}"=="unreleased"]
[[keyword-synthetic-source]]
==== Synthetic source
`keyword` fields support <<synthetic-source,synthetic `_source`>> in their
default configuration. Synthetic `_source` cannot be used together with
<<ignore-above,`ignore_above`>>, a <<normalizer,`normalizer`>>,
<<copy-to,`copy_to`>>, or with <<doc-values,`doc_values`>> disabled.

Synthetic source always sorts `keyword` fields and removes duplicates. For
example:
[source,console,id=synthetic-source-keyword-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"kwd": { "type": "keyword" }
}
}
}
PUT idx/_doc/1
{
"kwd": ["foo", "foo", "bar", "baz"]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"kwd": ["bar", "baz", "foo"]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

endif::[]

include::constant-keyword.asciidoc[]

include::wildcard.asciidoc[]
67 changes: 67 additions & 0 deletions docs/reference/mapping/types/numeric.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,70 @@ numeric field can't be both a time series dimension and a time series metric.
sorting) will behave as if the document had a value of +2.3+. High values
of `scaling_factor` improve accuracy but also increase space requirements.
This parameter is required.

ifeval::["{release-state}"=="unreleased"]
[[numeric-synthetic-source]]
==== Synthetic source
All numeric fields except `unsigned_long` support <<synthetic-source,synthetic
`_source`>> in their default configuration. Synthetic `_source` cannot be used
together with <<ignore-malformed,`ignore_malformed`>>, <<copy-to,`copy_to`>>, or
with <<doc-values,`doc_values`>> disabled.

Synthetic source always sorts numeric fields and removes duplicates. For example:
[source,console,id=synthetic-source-numeric-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"long": { "type": "long" }
}
}
}
PUT idx/_doc/1
{
"long": [0, 0, -123466, 87612]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:
[source,console-result]
----
{
"long": [-123466, 0, 0, 87612]
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

Scaled floats will always apply their scaling factor so:
[source,console,id=synthetic-source-scaled-float-example]
----
PUT idx
{
"mappings": {
"_source": { "synthetic": true },
"properties": {
"f": { "type": "scaled_float", "scaling_factor": 0.01 }
}
}
}
PUT idx/_doc/1
{
"f": 123
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]

Will become:

[source,console-result]
----
{
"f": 100.0
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]

endif::[]
Loading

0 comments on commit b18bafb

Please sign in to comment.