diff --git a/docs/reference/mapping/fields/source-field.asciidoc b/docs/reference/mapping/fields/source-field.asciidoc index 0720a7758b046..76d98303dce82 100644 --- a/docs/reference/mapping/fields/source-field.asciidoc +++ b/docs/reference/mapping/fields/source-field.asciidoc @@ -6,6 +6,17 @@ at index time. The `_source` field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing _fetch_ requests, like <> or <>. +ifeval::["{release-state}"=="unreleased"] +If disk usage is important to you then have a look at +<> which shrinks disk usage at the cost of +only supporting a subset of mappings and slower fetches or (not recommended) +<> which also shrinks disk +usage but disables many features. + +include::synthetic-source.asciidoc[] +endif::[] + + [[disable-source-field]] ==== Disabling the `_source` field diff --git a/docs/reference/mapping/fields/synthetic-source.asciidoc b/docs/reference/mapping/fields/synthetic-source.asciidoc new file mode 100644 index 0000000000000..204d65a026a69 --- /dev/null +++ b/docs/reference/mapping/fields/synthetic-source.asciidoc @@ -0,0 +1,120 @@ +[[synthetic-source]] +==== Synthetic `_source` + +Though very handy to have around, the source field takes up a significant amount +of space on disk. Instead of storing source documents on disk exactly as you +send them, Elasticsearch can reconstruct source content on the fly upon retrieval. +Enable this by setting `synthetic: true` in `_source`: + +[source,console,id=enable-synthetic-source-example] +---- +PUT idx +{ + "mappings": { + "_source": { + "synthetic": true + } + } +} +---- +// TESTSETUP + +While this on the fly reconstruction is *generally* slower than saving the source +documents verbatim and loading them at query time, it saves a lot of storage +space. There are a couple of restrictions to be aware of: + +* When you retrieve synthetic `_source` content it undergoes minor +<> compared to the original JSON. +* Synthetic `_source` can be used with indices that contain only these field +types: + +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> (with a `keyword` sub-field) + +[[synthetic-source-modifications]] +===== Synthetic source modifications + +When synthetic `_source` is enabled, retrieved documents undergo some +modifications compared to the original JSON. + +[[synthetic-source-modifications-leaf-arrays]] +====== Arrays moved to leaf fields +Synthetic `_source` arrays are moved to leaves. For example: + +[source,console,id=synthetic-source-leaf-arrays-example] +---- +PUT idx/_doc/1 +{ + "foo": [ + { + "bar": 1 + }, + { + "bar": 2 + } + ] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: + +[source,console-result] +---- +{ + "foo": { + "bar": [1, 2] + } +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +[[synthetic-source-modifications-field-names]] +====== Fields named as they are mapped +Synthetic source names fields as they are named in the mapping. When used +with <>, fields with dots (`.`) in their names are, by +default, interpreted as multiple objects, while dots in field names are +preserved within objects that have <> disabled. For example: + +[source,console,id=synthetic-source-objecty-example] +---- +PUT idx/_doc/1 +{ + "foo.bar.baz": 1 +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: + +[source,console-result] +---- +{ + "foo": { + "bar": { + "baz": 1 + } + } +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +[[synthetic-source-modifications-alphabetical]] +====== Alphabetical sorting +Synthetic `_source` fields are sorted alphabetically. The +https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as +"an unordered collection of zero or more name/value pairs" so applications +shouldn't care but without synthetic `_source` the original ordering is +preserved and some applications may, counter to the spec, do something with +that ordering. diff --git a/docs/reference/mapping/types/boolean.asciidoc b/docs/reference/mapping/types/boolean.asciidoc index 81055a0d2df5f..a549dc01c1c8a 100644 --- a/docs/reference/mapping/types/boolean.asciidoc +++ b/docs/reference/mapping/types/boolean.asciidoc @@ -214,3 +214,39 @@ The following parameters are accepted by `boolean` fields: <>:: Metadata about the field. + +ifeval::["{release-state}"=="unreleased"] +[[boolean-synthetic-source]] +==== Synthetic source +`boolean` fields support <> in their +default configuration. Synthetic `_source` cannot be used together with +<> or with <> disabled. + +Synthetic source always sorts `boolean` fields. For example: +[source,console,id=synthetic-source-boolean-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "bool": { "type": "boolean" } + } + } +} +PUT idx/_doc/1 +{ + "bool": [true, false, true, false] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: +[source,console-result] +---- +{ + "bool": [false, false, true, true] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] +endif::[] diff --git a/docs/reference/mapping/types/geo-point.asciidoc b/docs/reference/mapping/types/geo-point.asciidoc index 31effb539de9c..8883d95645ae6 100644 --- a/docs/reference/mapping/types/geo-point.asciidoc +++ b/docs/reference/mapping/types/geo-point.asciidoc @@ -203,3 +203,47 @@ For performance reasons, it is better to access the lat/lon values directly: def lat = doc['location'].lat; def lon = doc['location'].lon; -------------------------------------------------- + +ifeval::["{release-state}"=="unreleased"] +[[geo-point-synthetic-source]] +==== Synthetic source +`geo_point` fields support <> in their +default configuration. Synthetic `_source` cannot be used together with +<>, <>, or with +<> disabled. + +Synthetic source always sorts `geo_point` fields (first by latitude and then +longitude) and reduces them to their stored precision. For example: +[source,console,id=synthetic-source-geo-point-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "point": { "type": "geo_point" } + } + } +} +PUT idx/_doc/1 +{ + "point": [ + {"lat":-90, "lon":-80}, + {"lat":10, "lon":30} + ] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: +[source,console-result] +---- +{ + "point": [ + {"lat":-90.0, "lon":-80.00000000931323}, + {"lat":9.999999990686774, "lon":29.999999972060323} + ] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] +endif::[] diff --git a/docs/reference/mapping/types/ip.asciidoc b/docs/reference/mapping/types/ip.asciidoc index 2e598e40bbacc..5b1f249af6bbf 100644 --- a/docs/reference/mapping/types/ip.asciidoc +++ b/docs/reference/mapping/types/ip.asciidoc @@ -156,3 +156,47 @@ GET my-index-000001/_search } } -------------------------------------------------- + +ifeval::["{release-state}"=="unreleased"] +[[ip-synthetic-source]] +==== Synthetic source +`ip` fields support <> in their default +configuration. Synthetic `_source` cannot be used together with +<>, <>, or with +<> disabled. + +Synthetic source always sorts `ip` fields and removes duplicates. For example: +[source,console,id=synthetic-source-ip-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "ip": { "type": "ip" } + } + } +} +PUT idx/_doc/1 +{ + "ip": ["192.168.0.1", "192.168.0.1", "10.10.12.123", + "2001:db8::1:0:0:1", "::afff:4567:890a"] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: + +[source,console-result] +---- +{ + "ip": ["::afff:4567:890a", "10.10.12.123", "192.168.0.1", "2001:db8::1:0:0:1"] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +NOTE: IPv4 addresses are sorted as though they were IPv6 addresses prefixed by + `::ffff:0:0:0/96` as specified by + https://datatracker.ietf.org/doc/html/rfc6144[rfc6144]. + +endif::[] diff --git a/docs/reference/mapping/types/keyword.asciidoc b/docs/reference/mapping/types/keyword.asciidoc index c73e77aab94a2..03fb47ff81ca3 100644 --- a/docs/reference/mapping/types/keyword.asciidoc +++ b/docs/reference/mapping/types/keyword.asciidoc @@ -173,6 +173,46 @@ Dimension fields have the following constraints: ==== -- +ifeval::["{release-state}"=="unreleased"] +[[keyword-synthetic-source]] +==== Synthetic source +`keyword` fields support <> in their +default configuration. Synthetic `_source` cannot be used together with +<>, a <>, +<>, or with <> disabled. + +Synthetic source always sorts `keyword` fields and removes duplicates. For +example: +[source,console,id=synthetic-source-keyword-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "kwd": { "type": "keyword" } + } + } +} +PUT idx/_doc/1 +{ + "kwd": ["foo", "foo", "bar", "baz"] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: + +[source,console-result] +---- +{ + "kwd": ["bar", "baz", "foo"] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +endif::[] + include::constant-keyword.asciidoc[] include::wildcard.asciidoc[] diff --git a/docs/reference/mapping/types/numeric.asciidoc b/docs/reference/mapping/types/numeric.asciidoc index f33bbb99201c0..ae46a93f53007 100644 --- a/docs/reference/mapping/types/numeric.asciidoc +++ b/docs/reference/mapping/types/numeric.asciidoc @@ -233,3 +233,70 @@ numeric field can't be both a time series dimension and a time series metric. sorting) will behave as if the document had a value of +2.3+. High values of `scaling_factor` improve accuracy but also increase space requirements. This parameter is required. + +ifeval::["{release-state}"=="unreleased"] +[[numeric-synthetic-source]] +==== Synthetic source +All numeric fields except `unsigned_long` support <> in their default configuration. Synthetic `_source` cannot be used +together with <>, <>, or +with <> disabled. + +Synthetic source always sorts numeric fields and removes duplicates. For example: +[source,console,id=synthetic-source-numeric-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "long": { "type": "long" } + } + } +} +PUT idx/_doc/1 +{ + "long": [0, 0, -123466, 87612] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: +[source,console-result] +---- +{ + "long": [-123466, 0, 0, 87612] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +Scaled floats will always apply their scaling factor so: +[source,console,id=synthetic-source-scaled-float-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "f": { "type": "scaled_float", "scaling_factor": 0.01 } + } + } +} +PUT idx/_doc/1 +{ + "f": 123 +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: + +[source,console-result] +---- +{ + "f": 100.0 +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +endif::[] diff --git a/docs/reference/mapping/types/text.asciidoc b/docs/reference/mapping/types/text.asciidoc index eb386e37a81f7..87b0ab5009cf4 100644 --- a/docs/reference/mapping/types/text.asciidoc +++ b/docs/reference/mapping/types/text.asciidoc @@ -159,6 +159,63 @@ The following parameters are accepted by `text` fields: Metadata about the field. +ifeval::["{release-state}"=="unreleased"] +[[text-synthetic-source]] +==== Synthetic source +`text` fields support <> if they have +a `keyword` sub-field that supports synthetic `_source` and *do not* have +<>. + +Synthetic source always sorts `keyword` fields and removes duplicates, so +`text` fields are sorted based on the sub-`keyword` field. For example: +[source,console,id=synthetic-source-text-example] +---- +PUT idx +{ + "mappings": { + "_source": { "synthetic": true }, + "properties": { + "text": { + "type": "text", + "fields": { + "raw": { + "type": "keyword" + } + } + } + } + } +} +PUT idx/_doc/1 +{ + "text": [ + "the quick brown fox", + "the quick brown fox", + "jumped over the lazy dog" + ] +} +---- +// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/] + +Will become: +[source,console-result] +---- +{ + "text": [ + "jumped over the lazy dog", + "the quick brown fox" + ] +} +---- +// TEST[s/^/{"_source":/ s/\n$/}/] + +NOTE: Reordering text fields can have an effect on <> + and <> queries. See the discussion about + <> for more detail. You + can avoid this by making sure the `slop` parameter on the phrase queries + is lower than the `position_increment_gap`. This is the default. +endif::[] + [[fielddata-mapping-param]] ==== `fielddata` mapping parameter