first cut at ephemeral fields #9189

kiwigaffa · 2015-01-08T01:44:19Z

We have a large number of unique fields in our indices, which means
our cluster state is very large (>500MB). This size causes some
cluster level operations within ES to run more slowly than we'd like.

This change adds support for ephemeral fields, which

behave exactly like normal fields during indexing,
behave differently during search (a new Mapper is created on-demand
if required)
are invisible when rendering cluster state (solving our problem)

The basic idea is to add a new method to the Mapper interface
(isEphemeral()) then use this to control both the visibility of the
node in the cluster state tree, and the impact of adding such a node
to that tree.

In the first case (visibility), ephemeral nodes do not support
toXContent(). This effectively removes them from the serialized
cluster state string. In the case of nested fields, if all nodes under
a particular root are ephemeral, then that entire sub-tree become
invisible. However, the structure is visible if any child node is not
ephemeral. For example, if we have a field named a.b.c.d and it is
ephemeral and there are no other non-ephemeral fields under a, then no
part of that tree/path is visible. If we now add a field named a.b.f
and it is not ephemeral, then that field will be visible, including
the path to it. I.e we will serialize a, b, and f. Note that in this
case we still don't serialize c or d. By making ephemeral fields
"invisible" we reduce the size of the cluster state, and therefore
reduce or remove some of the issues we've seen in out cluster

In the second case (impact), adding an ephemeral node does not mark
the context as modified. This is an optimisation that reduces the
number of updates to the cluster state, since even if the context was
marked as modified when the field was added, the serialized form of
the tree would be identical to its previous (pre-addition) form due to
the field being invisible. In a scenario where we are adding a large
number of fields quickly, this optimization is important because it
eliminates "NO-OP" updates to the cluster state.

In order to support this functionality, we've added Ephemeral
subclasses of each of the standard field types, which seemed like the
lowest impact change to the code base. These new classes extend the
existing classes, but because of the way the Builder and TypeParser
classes are designed, we had to cut'n'paste these internal classes
from the base classes and make a few minor changes to reflect their
new parent classes. This does add maintenance overhead which could be
avoided by making the core classes natively support ephemerality, but
since we were not sure how viable this approach was, we chose minimal
impact on the code base over maintenance for this pull request.

Because these new field types are treated as peers of the current core
field types, we can use the existing field definition mechanisms to
define them in our config file. Most usefully to us, this means that
we can define templates using prefix or suffix matching on the field
names to use these new fields. For example, we can use the following
to define any field with an _INT suffix

"mappings" : {
"data" : {
"dynamic_templates" : [
{ "template_INT" : { "match" : "*_INT", "mapping" : { "type" : "ephemeral_integer" } } }
...

Also note that because they are sub-classes of the core classes, we
can use all of the standard modifiers for analyzer chain etc when
defining these fields.

The two main issues that we're aware of are:

Because we create a new field on demand when servicing a search
request, and do not try and inject that into the local context for the
index, there would (a) be additional GC activity to clean up these
short-lived objects, and (b) additional overhead on each search
request that uses ephemeral fields.
Because ephemeral fields are added to the context while indexing
but never removed there is concern that over time the indexing nodes
will be using a significant amount of heap for these fields. More
troubling is that this heap usage will not correspond with the
reported cluster state size. We're looking at a solution whereby we
could scrub fields from the cluster state using a REST request
(specifying a root node in the tree), but need more thinking on this.

kiwigaffa · 2015-01-08T19:41:23Z

I signed the company CLA, but did not add myself to it (last page) before submitting the pull request. After submitting, I signed the "contribute under company CLA" for myself (kiwigaffa). Is there any way to redo the checks to remove the "failed checks" messages.

jpountz · 2015-01-12T23:15:27Z

Related to #6619

jpountz · 2015-01-16T10:40:42Z

I am marking this issue as stalled as we are currently trying to reduce the complexity of mappings. We should revisit it once we're done (hopefully 2.0).

bleskes · 2015-01-16T15:09:38Z

This size causes some cluster level operations within ES to run more slowly than we'd like.

as a side note - we're working on quickening cluster change through deltas - see #9220

kevinkluge · 2015-01-20T14:32:51Z

@kiwigaffa there is a lot going on in this area as @jpountz mentioned. You can check out #9364, #9365, and #8870. #8870 leads to a bunch of other issues, including #8871.

You may also have seen we have a mapping label to group these:
https://github.com/elasticsearch/elasticsearch/labels/%3AMapping

Also, the partial cluster state update that @bleskes mentioned should help sites with a large number of mappings. Under normal circumstances, all the cluster state updates will be incremental, so adding a new mapping to the cluster state will be cheaper. There are still challenges with a large number of fields for Lucene....

kiwigaffa · 2015-01-20T18:32:56Z

Thanks Kevin,

Looking at all of these issues now to get a handle on what is coming. Is a
comment on the issue the best way to provide feedback, or is there a
back-channel you'd prefer?

cheers

Jon

On Tue, Jan 20, 2015 at 6:33 AM, Kevin Kluge notifications@github.com
wrote:

@kiwigaffa https://github.com/kiwigaffa there is a lot going on in this
area as @jpountz https://github.com/jpountz mentioned. You can check
out #9364 #9364,
#9365 #9365, and
#8870 #8870. #8870
#8870 leads to a
bunch of other issues, including #8871.

You may also have seen we have a mapping label to group these:
https://github.com/elasticsearch/elasticsearch/labels/%3AMapping

Also, the partial cluster state update that @bleskes
https://github.com/bleskes mentioned should help sites with a large
number of mappings. Under normal circumstances, all the cluster state
updates will be incremental, so adding a new mapping to the cluster state
will be cheaper. There are still challenges with a large number of fields
for Lucene....

—
Reply to this email directly or view it on GitHub
#9189 (comment)
.

kevinkluge · 2015-01-22T10:11:12Z

@kiwigaffa in the issues is best. thanks.

s1monw · 2015-03-20T21:10:01Z

@kiwigaffa I really appreciate your contribution here. Yet, given the efforts on reducing the clusterstate size on update and all the schema hardening we are doing I think a feature like this goes into the opposite direction we are heading right now. In elasticsearch we need to, more and more, take care of the majority of the user-base which certainly has smaller clusterstate and less fields that you have. Given the complexity of the change and the rather limited usecases or users in need I think we will not take this direction. @kiwigaffa and myself spoke in person about this and basically agreed on closing it until there is a less intrusive way to achieve the same.

that said, the changes we have in the pipeline for clusterstate updates etc. will be beneficial overall but in the long terms the goal is to reduce the number of fields in general.

first cut at ephemeral fields

002c67f

jpountz added the stalled label Jan 16, 2015

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

s1monw closed this Mar 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first cut at ephemeral fields #9189

first cut at ephemeral fields #9189

kiwigaffa commented Jan 8, 2015

kiwigaffa commented Jan 8, 2015

jpountz commented Jan 12, 2015

jpountz commented Jan 16, 2015

bleskes commented Jan 16, 2015

kevinkluge commented Jan 20, 2015

kiwigaffa commented Jan 20, 2015

kevinkluge commented Jan 22, 2015

s1monw commented Mar 20, 2015

first cut at ephemeral fields #9189

first cut at ephemeral fields #9189

Conversation

kiwigaffa commented Jan 8, 2015

kiwigaffa commented Jan 8, 2015

jpountz commented Jan 12, 2015

jpountz commented Jan 16, 2015

bleskes commented Jan 16, 2015

kevinkluge commented Jan 20, 2015

kiwigaffa commented Jan 20, 2015

kevinkluge commented Jan 22, 2015

s1monw commented Mar 20, 2015