Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for remaining recovery in disk allocator #58029

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.

This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.

Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.

This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.9.0 labels Jun 12, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jun 12, 2020
@DaveCTurner
Copy link
Contributor Author

Although I ran the :server test suite a good deal, CI is indicating that there's quite a few more test adjustments needed. Suggest holding off on detailed reviews until I've made CI happy; marking this as WIP for now.

@DaveCTurner DaveCTurner removed the WIP label Jun 12, 2020
@DaveCTurner
Copy link
Contributor Author

Ok it was a lot of noise from relatively few broken tests, this is good to go now.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor comments, overall looking good though.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial review and left some initial comments.

docs/reference/cluster/stats.asciidoc Outdated Show resolved Hide resolved
@@ -140,6 +140,10 @@ public void onNewInfo(ClusterInfo info) {
final DiskUsage usage = entry.value;
final RoutingNode routingNode = routingNodes.node(node);

final long reservedSpace = info.getReservedSpace(usage.getNodeId(), usage.getPath()).getTotal();
final DiskUsage usageWithReservedSpace = new DiskUsage(usage.getNodeId(), usage.getNodeName(), usage.getPath(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this compensation affects two functionalities:

  1. Triggering a reroute when going under low threshold
  2. Auto-releasing indices when disk usage goes below high threshold.

I think 1) is fine, but I am a bit concerned about 2), since the primary reason we only release when under high threshold is to have some hysteresis to not trigger flood-stage on/off rapidly. I think I would prefer to do the auto-release based on the non-compensated number (just like we move to flood-stage based on that), but am curious on your thoughts on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. I'll work on addressing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no longer sure about this. The compensated disk usage is never less than the non-compensated one, so auto-releasing based on the compensated number has more hysteresis than today's behaviour. Is the additional hysteresis the thing that concerns you @henningandersen or has one of us got our logic backwards?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is my concern. Something cured the situation (deleted index, moved shards etc.), but due to an otherwise unrelated recovery, we would keep halting indexing until that recovery is done. I would prefer to resume indexing (i.e., release the block) and then perhaps later find out that we need to halt it again. If cluster is truly full, we will get to that point, but if it is has enough space, we are likely to cure it before hitting the flood stage again anyway.

I think the additional hysteresis is unnecessary. AFAIK, we have not seen that the block was flapping too frequently and using the non-compensated number, we would simply keep the behavior unmodified from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, right. I don't think the additional hysteresis would be much of a problem in practice (if you hit the flood-stage watermark you already lost) and continuing to recover onto a node that's breached the flood-stage watermark is likely the real bug here. However it turns out not to add too much complexity so I adjusted this in d8afaf3.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I have just two comments/questions on exposing unknown size/-1 in the API.

(<<byte-units,byte value>>)
A prediction of how much larger the shard stores will eventually grow due to
ongoing peer recoveries, restoring snapshots, and similar activities. A value
of `-1b` indicates that this is not available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly torn on the -1b value here. It only takes one shard with unknown reserved size to hide the summarized value. And mostly this is just bad timing anyway, the information we have is not a consistent snapshot across the cluster and interpreting correlations between different parts of the stats in a very precise manner is unlikely to be fruitful.

Also, it looks as if TransportClusterStatsAction.nodeOperation only summarizes started shards, so I wonder if the reserved bytes will ever be anything but 0 here? Am I missing something (would not be the first time 🙂 ).

I think I am in favor of either removing this from cluster stats (if we think it is always 0) or changing StoreStats.add to treat unknown as 0 (or ideally, do everything we can to figure out the expected size like look at primary size, but that is a bigger ask and something for another day).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change the cluster-wide reserved size will indeed be zero as long as all nodes in the cluster are sufficiently new. It may be -1b in clusters with a mix of versions since older versions do not know how to supply this value.

With an upcoming change, however, it will be positive for some active shards (namely, searchable snapshots that are still warming up). It shouldn't be unknown for active shards on sufficiently new versions indeed, but I would still rather distinguish "nothing is reserved" from "unknown" in case this is helpful for some future analysis of a diagnostics bundle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of that statement in the opposite way, in that I could be unlucky to get a diagnostics with -1 out and then know nothing about the reserved space at a node summary level.

Also, in the mixed cluster case, I would prefer to assume 0 from old versions. It would otherwise stay -1 at the summarized level until all data nodes are upgraded, which again hides the actual number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed 6cee87b with your idea to treat unknown as 0 in the aggregated stats, but still report -1 in the shard-level stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?

docs/reference/cluster/nodes-stats.asciidoc Show resolved Hide resolved
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

(<<byte-units,byte value>>)
A prediction of how much larger the shard stores will eventually grow due to
ongoing peer recoveries, restoring snapshots, and similar activities. A value
of `-1b` indicates that this is not available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?

@DaveCTurner DaveCTurner merged commit 83d6589 into elastic:master Jul 1, 2020
@DaveCTurner DaveCTurner deleted the 2020-06-12-reserve-bytes-in-store-stats branch July 1, 2020 07:04
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 1, 2020
Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.

This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.

Backport of elastic#58029 to 7.x
DaveCTurner added a commit that referenced this pull request Jul 1, 2020
Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.

This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.

Backport of #58029 to 7.x

* Picky picky

* Missing type
DaveCTurner added a commit that referenced this pull request Jul 1, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 1, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 4, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Meta label for distributed team (obsolete) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants