-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first cut at ephemeral fields #9189
Conversation
I signed the company CLA, but did not add myself to it (last page) before submitting the pull request. After submitting, I signed the "contribute under company CLA" for myself (kiwigaffa). Is there any way to redo the checks to remove the "failed checks" messages. |
Related to #6619 |
I am marking this issue as stalled as we are currently trying to reduce the complexity of mappings. We should revisit it once we're done (hopefully 2.0). |
as a side note - we're working on quickening cluster change through deltas - see #9220 |
@kiwigaffa there is a lot going on in this area as @jpountz mentioned. You can check out #9364, #9365, and #8870. #8870 leads to a bunch of other issues, including #8871. You may also have seen we have a mapping label to group these: Also, the partial cluster state update that @bleskes mentioned should help sites with a large number of mappings. Under normal circumstances, all the cluster state updates will be incremental, so adding a new mapping to the cluster state will be cheaper. There are still challenges with a large number of fields for Lucene.... |
Thanks Kevin, Looking at all of these issues now to get a handle on what is coming. Is a cheers Jon On Tue, Jan 20, 2015 at 6:33 AM, Kevin Kluge notifications@github.com
|
@kiwigaffa in the issues is best. thanks. |
@kiwigaffa I really appreciate your contribution here. Yet, given the efforts on reducing the clusterstate size on update and all the schema hardening we are doing I think a feature like this goes into the opposite direction we are heading right now. In elasticsearch we need to, more and more, take care of the majority of the user-base which certainly has smaller clusterstate and less fields that you have. Given the complexity of the change and the rather limited usecases or users in need I think we will not take this direction. @kiwigaffa and myself spoke in person about this and basically agreed on closing it until there is a less intrusive way to achieve the same. that said, the changes we have in the pipeline for clusterstate updates etc. will be beneficial overall but in the long terms the goal is to reduce the number of fields in general. |
We have a large number of unique fields in our indices, which means
our cluster state is very large (>500MB). This size causes some
cluster level operations within ES to run more slowly than we'd like.
This change adds support for ephemeral fields, which
if required)
The basic idea is to add a new method to the Mapper interface
(isEphemeral()) then use this to control both the visibility of the
node in the cluster state tree, and the impact of adding such a node
to that tree.
In the first case (visibility), ephemeral nodes do not support
toXContent(). This effectively removes them from the serialized
cluster state string. In the case of nested fields, if all nodes under
a particular root are ephemeral, then that entire sub-tree become
invisible. However, the structure is visible if any child node is not
ephemeral. For example, if we have a field named a.b.c.d and it is
ephemeral and there are no other non-ephemeral fields under a, then no
part of that tree/path is visible. If we now add a field named a.b.f
and it is not ephemeral, then that field will be visible, including
the path to it. I.e we will serialize a, b, and f. Note that in this
case we still don't serialize c or d. By making ephemeral fields
"invisible" we reduce the size of the cluster state, and therefore
reduce or remove some of the issues we've seen in out cluster
In the second case (impact), adding an ephemeral node does not mark
the context as modified. This is an optimisation that reduces the
number of updates to the cluster state, since even if the context was
marked as modified when the field was added, the serialized form of
the tree would be identical to its previous (pre-addition) form due to
the field being invisible. In a scenario where we are adding a large
number of fields quickly, this optimization is important because it
eliminates "NO-OP" updates to the cluster state.
In order to support this functionality, we've added Ephemeral
subclasses of each of the standard field types, which seemed like the
lowest impact change to the code base. These new classes extend the
existing classes, but because of the way the Builder and TypeParser
classes are designed, we had to cut'n'paste these internal classes
from the base classes and make a few minor changes to reflect their
new parent classes. This does add maintenance overhead which could be
avoided by making the core classes natively support ephemerality, but
since we were not sure how viable this approach was, we chose minimal
impact on the code base over maintenance for this pull request.
Because these new field types are treated as peers of the current core
field types, we can use the existing field definition mechanisms to
define them in our config file. Most usefully to us, this means that
we can define templates using prefix or suffix matching on the field
names to use these new fields. For example, we can use the following
to define any field with an _INT suffix
"mappings" : {
"data" : {
"dynamic_templates" : [
{ "template_INT" : { "match" : "*_INT", "mapping" : { "type" : "ephemeral_integer" } } }
...
Also note that because they are sub-classes of the core classes, we
can use all of the standard modifiers for analyzer chain etc when
defining these fields.
The two main issues that we're aware of are:
Because we create a new field on demand when servicing a search
request, and do not try and inject that into the local context for the
index, there would (a) be additional GC activity to clean up these
short-lived objects, and (b) additional overhead on each search
request that uses ephemeral fields.
Because ephemeral fields are added to the context while indexing
but never removed there is concern that over time the indexing nodes
will be using a significant amount of heap for these fields. More
troubling is that this heap usage will not correspond with the
reported cluster state size. We're looking at a solution whereby we
could scrub fields from the cluster state using a REST request
(specifying a root node in the tree), but need more thinking on this.