Make doc values accessible from update scripts #29290

bra-fsn · 2018-03-29T08:07:13Z

Feature request:
Elasticsearch supports a very compact representation/storage of data: the doc values.
Sometimes it's useful to omit fields from _source and store them only in doc values, this can be a real space saver (#27374 (comment))
Of course when you do this, you can't do partial updates on documents, because the previous value of the field is only stored in doc values.
This could be solved by making doc values accessible from update scripts, so you could do a scripted update like this:
ctx._source.counter=doc['counter']+1;

It would be nice if doc values could be accessed this way in update scripts along with the previous version of _source.

martijnvg · 2018-03-29T08:55:45Z

Hey @bra-fsn, thanks for opening this feature request. I agree that would be a big saver, but the update api can only update the document based on the unmodified _source. A normalizer may have modified the original content, so doc values can't be used as replacement for the _source. Normalizers are not applied on numeric fields, but then you end up in cases where some doc value fields can be updated and some not, which makes it tricky. So this isn't something that we can support consistently in the near future with how handle the original content of a document today.

I'll leave this open for others to chime in, but even without the above concern this is a high hanging fruit.

elasticmachine · 2018-03-29T08:56:01Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-03-29T08:56:43Z

Pinging @elastic/es-distributed

bra-fsn · 2018-03-29T09:06:01Z

@martijnvg: Could you please give further details about what you've described above?
What this feature request is about is just the ability to access (current) doc value fields for the doc which is updated from painless.
I may misunderstood what you wrote, but I don't want a _source replacement mechanism here, under which I mean that elastic magically and transparently fetching previous values from doc values if the _source doesn't have that info and filling them into the new version on update.
I think you can leave this to the user.

Or you are talking about consistency issues (I don't know elastic internals), maybe that the doc values are updated differently than the doc itself, and a race condition could occur if more updates happen in parallel? (so for example in a painless script it's now guaranteed that the _source doesn't change while the script runs, but you can't guarantee this on doc values, so it may be that two subsequent calls to doc['field'] in the same script instance may give different results?)

martijnvg · 2018-03-29T09:52:36Z

I may be misunderstood what you wrote, but I don't want a _source replacement mechanism here, under which I mean that elastic magically and transparently fetching previous values from doc values if the _source doesn't have that info and filling them into the new version on update.
I think you can leave this to the user.

Sure and the way you propose this it kind of already is up to the user by using the square bracket notation (doc['counter']) to fetch the values. But that is not what I meant with inconsistent.

Or you are talking about consistency issues (I don't know elastic internals), maybe that the doc values are updated differently than the doc itself, and a race condition could occur if more updates happen in parallel? (so for example in a painless script it's now guaranteed that the _source doesn't change while the script runs, but you can't guarantee this on doc values, so it may be that two subsequent calls to doc['field'] in the same script instance may give different results?)

No, that kind of problems will luckily not happen. If a normalizer is configured for a field and it changes the content of a doc values field at index time, then it is no longer the field value that was originally specified. Now if these doc values field are used to update a document, then this can lead to unexpected end result in the updated document.

But besides this, with #29264 in, reading from doc values fields is no longer possible, because the the _source may be read from the translog instead of the lucene index and doc values are only available in the Lucene index. Forcing to read from the Lucene index in order to support this, is a bad idea, because that forces a refresh.

bra-fsn · 2018-03-29T10:55:26Z

Sure and the way you propose this it kind of already is up to the user by using the square bracket notation (doc['counter']) to fetch the values. But that is not what I meant with inconsistent.

You mean this should work (at least in current GA versions)?
I've got an
RequestError(400, u'illegal_argument_exception', {u'status': 400, u'error': {u'caused_by': {u'lang': u'painless', u'script': u'counter', u'script_stack': [u"ctx._source.counter=doc['counter']+1;\n", u' ^---- HERE'], u'caused_by': {u'reason': None, u'type': u'null_pointer_exception'}, u'reason': u'runtime error', u'type': u'script_exception'}, u'root_cause': [{u'reason': u'[wdxwsRv][127.0.0.1:9300][indices:data/write/update[s]]', u'type': u'remote_transport_exception'}], u'type': u'illegal_argument_exception', u'reason': u'failed to execute script'}})
when I've tried it.

No, that kind of problems will luckily not happen. If a normalizer is configured for a field and it changes the content of a doc values field at index time, then it is no longer the field value that was originally specified. Now if these doc values field are used to update a document, then this can lead to unexpected end result in the updated document.

I think it's OK. If somebody uses this, this should be evident. The same is true with every other doc values operation, which elastic current allows (like in the search path).

But besides this, with #29264 in, reading from doc values fields is no longer possible, because the the _source may be read from the translog instead of the lucene index and doc values are only available in the Lucene index. Forcing to read from the Lucene index in order to support this, is a bad idea, because that forces a refresh.

First of all, this is great, we're suffering from slow scripted updates. Glad to hear that the previous optimization will work again (for updates at least).
Second, it's sad to hear that. I think elasticsearch has great possibilities in offering more support for doc values. It gives tremendous space savings compared to storing data in _source field.

jpountz · 2018-03-29T11:25:43Z

For the record, having access to doc values would be a problem with #29264. cc @s1monw

jpountz · 2018-03-29T11:27:40Z

Woops sorry I read the history of this conversation too quickly.

martijnvg · 2018-03-29T11:28:18Z

You mean this should work (at least in current GA versions)?

Oh no :), but that is the same syntax we use in the other were scripts are allowed to access doc values.

bra-fsn · 2018-03-29T11:51:07Z

@jpountz : Thanks for joining. :)
I think it would be a great addition to support this.
We store large numbers which take a lot of space thanks to the "json as a text" _source storage. Until this changes (by switching to BSON or something similar solution, which -I guess- still won't be as effective as the current doc values) tremendous space can be saved with DVs.

Also, it could help reindexing as well. For example we have some fields which we can't store in _source, because the DB would be so big, but we want to index the field.
Currently we store the value in a more effective database (plain old files :) ) and exclude the field from _source. This way if we have to reindex the index, we have to re-read the data from files.
If we could store the data in DVs (and use it in ways like it's described in this issue and in #27374), we could forget about all of this mess, and just store the data in elastic.
It's suprisingly good in it, sometimes beating the filesystem itself!
That way when we have to do a reindex, we could just use the data in DVs and we could even use elastic's reindex API.

jpountz · 2018-03-29T15:26:35Z

The fact that the update API doesn't allow to use doc values is so important to keep complexity reasonable that I'm reluctant to add such a feature. I know it's less convenient but it's always possible to implement the update logic on client-side by using versioning to make sure the right version of the document is being replaced.

rjernst · 2018-03-29T20:08:51Z

We discussed in FixItDay and agreed we do not want to introduce the complexity this would entail. I'm going to close this issue since there is technical consensus that we should not make such a change.

bleskes · 2018-04-10T09:57:44Z

There's another aspect here that I didn't see mentioned but I think is important for future readers - allowing access to lucene features (like doc values) collides with reading from the translog but also prevents us from shifting to running the update scripts on the coordinating nodes (#8369), either as a dedicated API or as part of bulk.

martijnvg added discuss :Data Management/Indices APIs APIs to create and manage indices and templates labels Mar 29, 2018

ywelsch added the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Mar 29, 2018

rjernst closed this as completed Mar 29, 2018

rjernst mentioned this issue Apr 27, 2018

Doc values (doc['field'] notation) does not work on update #30210

Closed

OlivierPetri mentioned this issue May 2, 2018

Strict Type Checking on Import (or strict type usage for scripts) #30211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make doc values accessible from update scripts #29290

Make doc values accessible from update scripts #29290

bra-fsn commented Mar 29, 2018 •

edited

Loading

martijnvg commented Mar 29, 2018 •

edited

Loading

elasticmachine commented Mar 29, 2018

elasticmachine commented Mar 29, 2018

bra-fsn commented Mar 29, 2018 •

edited

Loading

martijnvg commented Mar 29, 2018

bra-fsn commented Mar 29, 2018

jpountz commented Mar 29, 2018

jpountz commented Mar 29, 2018

martijnvg commented Mar 29, 2018

bra-fsn commented Mar 29, 2018 •

edited

Loading

jpountz commented Mar 29, 2018

rjernst commented Mar 29, 2018

bleskes commented Apr 10, 2018

Make doc values accessible from update scripts #29290

Make doc values accessible from update scripts #29290

Comments

bra-fsn commented Mar 29, 2018 • edited Loading

martijnvg commented Mar 29, 2018 • edited Loading

elasticmachine commented Mar 29, 2018

elasticmachine commented Mar 29, 2018

bra-fsn commented Mar 29, 2018 • edited Loading

martijnvg commented Mar 29, 2018

bra-fsn commented Mar 29, 2018

jpountz commented Mar 29, 2018

jpountz commented Mar 29, 2018

martijnvg commented Mar 29, 2018

bra-fsn commented Mar 29, 2018 • edited Loading

jpountz commented Mar 29, 2018

rjernst commented Mar 29, 2018

bleskes commented Apr 10, 2018

bra-fsn commented Mar 29, 2018 •

edited

Loading

martijnvg commented Mar 29, 2018 •

edited

Loading

bra-fsn commented Mar 29, 2018 •

edited

Loading

bra-fsn commented Mar 29, 2018 •

edited

Loading