Switch doc_values=true to be the default for non-analyzed string fields #8312

gibrown · 2014-10-31T20:53:04Z

We recently started doing a lot of sorting by dates and integer fields on subsets of our 5 billion documents. We are mostly running queries that filter to a few thousand docs and then sorting that subset. Needing to load 5 billion dates into memory for this seemingly common use case make ES feel broken. :)

Heap memory management can be very painful and I think represents a difficult point for new users. Search engines are all about preprocessing data (indexing) so that querying can be fast. In my opinion doc_values accomplishes this better than the current default even if it can be slower in some cases.

FWIW, doc_values is MUCH faster in our use case using 1.3.4.

bobrik · 2014-12-04T13:30:11Z

@gibrown have you seen increased segment memory usage after switching to doc_values? I've seen 2-8x increase (from 32mb to 70-270mb) per index with 120m events.

clintongormley · 2014-12-31T17:18:49Z

The JVM heap size is practically limited to 32GB - above this and the JVM can no longer use compressed pointers (resulting in more memory usage for the same data), and garbage collections become slower.

By far the biggest user of the heap for most users is in-memory fielddata, used for sorting, aggregations and scripts. In-memory fielddata is slow to load, as it has to read the whole inverted index and uninvert it. If the fielddata cache fills up, old data is evicted causing heap churn and bad performance (as fielddata is reloaded and evicted again.)

Doc values provide the same function as in-memory fielddata, but the datastructure is written to disk at index time. This results in more disk usage and somewhat slower indexing and mergging (because there is more I/O). Aggregations and sorting are about 20% slower than they are with in-memory fielddata.

The advantages are:

less heap usage and faster garbage collections
no longer limited by the amount of fielddata that can fit into 32GB of heap - instead the file system caches can make use of all the available RAM
fewer latency spikes caused by reloading a large segment into memory

Unfortunately, there is no way to back-fill these values without reindexing. The proposal is to make doc values the default for all fields except analyzed string fields (which are not supported by doc values anyway).

Some users will end up writing much more data than they need, but to be clear: this is a default setting which can be changed as appropriate. It is similar to the fact that we index term positions by default on all analyzed string fields, so as to make phrase queries work out of the box.

Before making this decision, we need to understand the full impact of doc-values-by-default, by running the following tests:

ssd vs spinning disk
merge throttling enabled/disabled
non-string fields vs all not_analyzed fields
metric, logging, wikipedia
aggs
sorting asc and desc
multiple aggs and sorting combined

And measuring the following:

Node stats after indexing (and always starting from zero)
number of segments before optimize
index size before/after optimize
indexing rate (docs/s)
query rate (docs/s)
query rate during heavy indexing
query rate during heavy indexing with eager loading
query rate during heavy indexing with eager global ordinals
logs for GC messages and index throttling messages

We need to do this at large scale with a billion documents, on nodes that are properly tuned (eg ES_HEAP, mlockall etc)

Two further proposals:

Turn off fielddata loading for analyzed string fields by default. It is too easy to blow up memory usage by sorting or aggregating on an analyzed string field, and it is seldom what the user intended. If the user really does want to run aggregations on an analyzed string field, then it is an easy option to enable on an existing index.
Default not_analyzed string fields to have ignore_above set to (eg) 255 characters, as it seldom makes sense to index or write doc values for such large terms

bharvidixit · 2015-03-13T10:20:53Z

@clintongormley what if i enable doc_values=true for the metadata _id field? Will it have some effect? "_id": {
"store": true,
"index": "not_analyzed",
"doc_values": true,
"path": "id"
}

jpountz · 2015-03-16T03:24:27Z

@bharvidixit Actually we are also thinking about having doc values enabled on _uid too. This way, random sorting (which mostly merges a seed with a hash of the _uid to be reproducible) would not need to load fielddata on the _uid field (which takes a lot of memory all the time since this field is unique by definition). And it could also help have consistent pagination by tie-breaking on the _uid instead of the internal lucene doc ids (since they are not the same on all copies of a shard).

Doc values significantly reduced heap usage, which results in faster GCs. This change makes the default for doc values dynamic: any field that is indexed but not analyzed now has doc values. This only affects fields on indexes created with 2.0+. closes elastic#8312 closes elastic#10209

gibrown · 2015-03-27T20:10:54Z

Awesome, Thanks!

bobrik · 2015-03-27T20:15:32Z

👍

jknewman3 · 2015-04-01T23:12:46Z

Is it possible to set eager loading, or something like that, for doc_values?
It would be helpful for us to always have the most recently added data in cache.
And, yeah, warmers could do it for us. Just wondering if eager applies only to JVM.

rjernst · 2015-04-02T19:50:28Z

@jknewman3 See discussions on #8693

jknewman3 · 2015-04-02T21:47:29Z

Ah, that helps a lot. Thanks!

bolee · 2015-08-06T09:53:32Z

how to set doc_value=true, via api to set every field?

clintongormley · 2015-08-06T10:00:01Z

@bolee please ask questions like that in the forum: https://discuss.elastic.co/

clintongormley added v2.0.0-beta1 >enhancement help wanted adoptme labels Nov 1, 2014

clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types and removed help wanted adoptme labels Dec 31, 2014

clintongormley mentioned this issue Dec 31, 2014

Terms agg with zero-result filtered query searches whole index #8406

Closed

clintongormley mentioned this issue Mar 3, 2015

Roadmap for 2.0 #9970

Closed

14 tasks

rjernst mentioned this issue Mar 22, 2015

Enable doc values by default, when appropriate #10209

Merged

rjernst removed the discuss label Mar 23, 2015

rjernst closed this as completed in #10209 Mar 27, 2015

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch doc_values=true to be the default for non-analyzed string fields #8312

Switch doc_values=true to be the default for non-analyzed string fields #8312

gibrown commented Oct 31, 2014

bobrik commented Dec 4, 2014

clintongormley commented Dec 31, 2014

bharvidixit commented Mar 13, 2015

jpountz commented Mar 16, 2015

gibrown commented Mar 27, 2015

bobrik commented Mar 27, 2015

jknewman3 commented Apr 1, 2015

rjernst commented Apr 2, 2015

jknewman3 commented Apr 2, 2015

bolee commented Aug 6, 2015

clintongormley commented Aug 6, 2015

Switch doc_values=true to be the default for non-analyzed string fields #8312

Switch doc_values=true to be the default for non-analyzed string fields #8312

Comments

gibrown commented Oct 31, 2014

bobrik commented Dec 4, 2014

clintongormley commented Dec 31, 2014

bharvidixit commented Mar 13, 2015

jpountz commented Mar 16, 2015

gibrown commented Mar 27, 2015

bobrik commented Mar 27, 2015

jknewman3 commented Apr 1, 2015

rjernst commented Apr 2, 2015

jknewman3 commented Apr 2, 2015

bolee commented Aug 6, 2015

clintongormley commented Aug 6, 2015