-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch doc_values=true to be the default for non-analyzed string fields #8312
Comments
@gibrown have you seen increased segment memory usage after switching to |
The JVM heap size is practically limited to 32GB - above this and the JVM can no longer use compressed pointers (resulting in more memory usage for the same data), and garbage collections become slower. By far the biggest user of the heap for most users is in-memory fielddata, used for sorting, aggregations and scripts. In-memory fielddata is slow to load, as it has to read the whole inverted index and uninvert it. If the fielddata cache fills up, old data is evicted causing heap churn and bad performance (as fielddata is reloaded and evicted again.) Doc values provide the same function as in-memory fielddata, but the datastructure is written to disk at index time. This results in more disk usage and somewhat slower indexing and mergging (because there is more I/O). Aggregations and sorting are about 20% slower than they are with in-memory fielddata. The advantages are:
Unfortunately, there is no way to back-fill these values without reindexing. The proposal is to make doc values the default for all fields except Some users will end up writing much more data than they need, but to be clear: this is a default setting which can be changed as appropriate. It is similar to the fact that we index term positions by default on all Before making this decision, we need to understand the full impact of doc-values-by-default, by running the following tests:
And measuring the following:
We need to do this at large scale with a billion documents, on nodes that are properly tuned (eg ES_HEAP, mlockall etc) Two further proposals:
|
@clintongormley what if i enable doc_values=true for the metadata _id field? Will it have some effect? "_id": { |
@bharvidixit Actually we are also thinking about having doc values enabled on |
Doc values significantly reduced heap usage, which results in faster GCs. This change makes the default for doc values dynamic: any field that is indexed but not analyzed now has doc values. This only affects fields on indexes created with 2.0+. closes elastic#8312 closes elastic#10209
Awesome, Thanks! |
👍 |
Is it possible to set eager loading, or something like that, for doc_values? |
@jknewman3 See discussions on #8693 |
Ah, that helps a lot. Thanks! |
how to set doc_value=true, via api to set every field? |
@bolee please ask questions like that in the forum: https://discuss.elastic.co/ |
We recently started doing a lot of sorting by dates and integer fields on subsets of our 5 billion documents. We are mostly running queries that filter to a few thousand docs and then sorting that subset. Needing to load 5 billion dates into memory for this seemingly common use case make ES feel broken. :)
Heap memory management can be very painful and I think represents a difficult point for new users. Search engines are all about preprocessing data (indexing) so that querying can be fast. In my opinion doc_values accomplishes this better than the current default even if it can be slower in some cases.
FWIW, doc_values is MUCH faster in our use case using 1.3.4.
The text was updated successfully, but these errors were encountered: