Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make doc values accessible from update scripts #29290

Closed
bra-fsn opened this issue Mar 29, 2018 · 13 comments
Closed

Make doc values accessible from update scripts #29290

bra-fsn opened this issue Mar 29, 2018 · 13 comments
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@bra-fsn
Copy link

bra-fsn commented Mar 29, 2018

Feature request:
Elasticsearch supports a very compact representation/storage of data: the doc values.
Sometimes it's useful to omit fields from _source and store them only in doc values, this can be a real space saver (#27374 (comment))
Of course when you do this, you can't do partial updates on documents, because the previous value of the field is only stored in doc values.
This could be solved by making doc values accessible from update scripts, so you could do a scripted update like this:
ctx._source.counter=doc['counter']+1;

It would be nice if doc values could be accessed this way in update scripts along with the previous version of _source.

@martijnvg
Copy link
Member

martijnvg commented Mar 29, 2018

Hey @bra-fsn, thanks for opening this feature request. I agree that would be a big saver, but the update api can only update the document based on the unmodified _source. A normalizer may have modified the original content, so doc values can't be used as replacement for the _source. Normalizers are not applied on numeric fields, but then you end up in cases where some doc value fields can be updated and some not, which makes it tricky. So this isn't something that we can support consistently in the near future with how handle the original content of a document today.

I'll leave this open for others to chime in, but even without the above concern this is a high hanging fruit.

@martijnvg martijnvg added discuss :Data Management/Indices APIs APIs to create and manage indices and templates labels Mar 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@ywelsch ywelsch added the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Mar 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bra-fsn
Copy link
Author

bra-fsn commented Mar 29, 2018

@martijnvg: Could you please give further details about what you've described above?
What this feature request is about is just the ability to access (current) doc value fields for the doc which is updated from painless.
I may misunderstood what you wrote, but I don't want a _source replacement mechanism here, under which I mean that elastic magically and transparently fetching previous values from doc values if the _source doesn't have that info and filling them into the new version on update.
I think you can leave this to the user.

Or you are talking about consistency issues (I don't know elastic internals), maybe that the doc values are updated differently than the doc itself, and a race condition could occur if more updates happen in parallel? (so for example in a painless script it's now guaranteed that the _source doesn't change while the script runs, but you can't guarantee this on doc values, so it may be that two subsequent calls to doc['field'] in the same script instance may give different results?)

@martijnvg
Copy link
Member

I may be misunderstood what you wrote, but I don't want a _source replacement mechanism here, under which I mean that elastic magically and transparently fetching previous values from doc values if the _source doesn't have that info and filling them into the new version on update.
I think you can leave this to the user.

Sure and the way you propose this it kind of already is up to the user by using the square bracket notation (doc['counter']) to fetch the values. But that is not what I meant with inconsistent.

Or you are talking about consistency issues (I don't know elastic internals), maybe that the doc values are updated differently than the doc itself, and a race condition could occur if more updates happen in parallel? (so for example in a painless script it's now guaranteed that the _source doesn't change while the script runs, but you can't guarantee this on doc values, so it may be that two subsequent calls to doc['field'] in the same script instance may give different results?)

No, that kind of problems will luckily not happen. If a normalizer is configured for a field and it changes the content of a doc values field at index time, then it is no longer the field value that was originally specified. Now if these doc values field are used to update a document, then this can lead to unexpected end result in the updated document.

But besides this, with #29264 in, reading from doc values fields is no longer possible, because the the _source may be read from the translog instead of the lucene index and doc values are only available in the Lucene index. Forcing to read from the Lucene index in order to support this, is a bad idea, because that forces a refresh.

@bra-fsn
Copy link
Author

bra-fsn commented Mar 29, 2018

Sure and the way you propose this it kind of already is up to the user by using the square bracket notation (doc['counter']) to fetch the values. But that is not what I meant with inconsistent.

You mean this should work (at least in current GA versions)?
I've got an
RequestError(400, u'illegal_argument_exception', {u'status': 400, u'error': {u'caused_by': {u'lang': u'painless', u'script': u'counter', u'script_stack': [u"ctx._source.counter=doc['counter']+1;\n", u' ^---- HERE'], u'caused_by': {u'reason': None, u'type': u'null_pointer_exception'}, u'reason': u'runtime error', u'type': u'script_exception'}, u'root_cause': [{u'reason': u'[wdxwsRv][127.0.0.1:9300][indices:data/write/update[s]]', u'type': u'remote_transport_exception'}], u'type': u'illegal_argument_exception', u'reason': u'failed to execute script'}})
when I've tried it.

No, that kind of problems will luckily not happen. If a normalizer is configured for a field and it changes the content of a doc values field at index time, then it is no longer the field value that was originally specified. Now if these doc values field are used to update a document, then this can lead to unexpected end result in the updated document.

I think it's OK. If somebody uses this, this should be evident. The same is true with every other doc values operation, which elastic current allows (like in the search path).

But besides this, with #29264 in, reading from doc values fields is no longer possible, because the the _source may be read from the translog instead of the lucene index and doc values are only available in the Lucene index. Forcing to read from the Lucene index in order to support this, is a bad idea, because that forces a refresh.

First of all, this is great, we're suffering from slow scripted updates. Glad to hear that the previous optimization will work again (for updates at least).
Second, it's sad to hear that. I think elasticsearch has great possibilities in offering more support for doc values. It gives tremendous space savings compared to storing data in _source field.

@jpountz
Copy link
Contributor

jpountz commented Mar 29, 2018

For the record, having access to doc values would be a problem with #29264. cc @s1monw

@jpountz
Copy link
Contributor

jpountz commented Mar 29, 2018

Woops sorry I read the history of this conversation too quickly.

@martijnvg
Copy link
Member

You mean this should work (at least in current GA versions)?

Oh no :), but that is the same syntax we use in the other were scripts are allowed to access doc values.

@bra-fsn
Copy link
Author

bra-fsn commented Mar 29, 2018

@jpountz : Thanks for joining. :)
I think it would be a great addition to support this.
We store large numbers which take a lot of space thanks to the "json as a text" _source storage. Until this changes (by switching to BSON or something similar solution, which -I guess- still won't be as effective as the current doc values) tremendous space can be saved with DVs.

Also, it could help reindexing as well. For example we have some fields which we can't store in _source, because the DB would be so big, but we want to index the field.
Currently we store the value in a more effective database (plain old files :) ) and exclude the field from _source. This way if we have to reindex the index, we have to re-read the data from files.
If we could store the data in DVs (and use it in ways like it's described in this issue and in #27374), we could forget about all of this mess, and just store the data in elastic.
It's suprisingly good in it, sometimes beating the filesystem itself!
That way when we have to do a reindex, we could just use the data in DVs and we could even use elastic's reindex API.

@jpountz
Copy link
Contributor

jpountz commented Mar 29, 2018

The fact that the update API doesn't allow to use doc values is so important to keep complexity reasonable that I'm reluctant to add such a feature. I know it's less convenient but it's always possible to implement the update logic on client-side by using versioning to make sure the right version of the document is being replaced.

@rjernst
Copy link
Member

rjernst commented Mar 29, 2018

We discussed in FixItDay and agreed we do not want to introduce the complexity this would entail. I'm going to close this issue since there is technical consensus that we should not make such a change.

@rjernst rjernst closed this as completed Mar 29, 2018
@bleskes
Copy link
Contributor

bleskes commented Apr 10, 2018

There's another aspect here that I didn't see mentioned but I think is important for future readers - allowing access to lucene features (like doc values) collides with reading from the translog but also prevents us from shifting to running the update scripts on the coordinating nodes (#8369), either as a dedicated API or as part of bulk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

No branches or pull requests

7 participants