Account for remaining recovery in disk allocator #58029

DaveCTurner · 2020-06-12T09:35:58Z

Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.

This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.

Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately.

elasticmachine · 2020-06-12T09:36:01Z

Pinging @elastic/es-distributed (:Distributed/Allocation)

DaveCTurner · 2020-06-12T10:08:53Z

Although I ran the :server test suite a good deal, CI is indicating that there's quite a few more test adjustments needed. Suggest holding off on detailed reviews until I've made CI happy; marking this as WIP for now.

DaveCTurner · 2020-06-12T12:13:25Z

Ok it was a lot of noise from relatively few broken tests, this is good to go now.

ywelsch

I've left some minor comments, overall looking good though.

server/src/main/java/org/elasticsearch/cluster/ClusterInfo.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

henningandersen

I did an initial review and left some initial comments.

docs/reference/cluster/stats.asciidoc

server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java

henningandersen · 2020-06-15T10:07:46Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

@@ -140,6 +140,10 @@ public void onNewInfo(ClusterInfo info) {
            final DiskUsage usage = entry.value;
            final RoutingNode routingNode = routingNodes.node(node);

+            final long reservedSpace = info.getReservedSpace(usage.getNodeId(), usage.getPath()).getTotal();
+            final DiskUsage usageWithReservedSpace = new DiskUsage(usage.getNodeId(), usage.getNodeName(), usage.getPath(),


AFAICS this compensation affects two functionalities:

Triggering a reroute when going under low threshold

Auto-releasing indices when disk usage goes below high threshold.

I think 1) is fine, but I am a bit concerned about 2), since the primary reason we only release when under high threshold is to have some hysteresis to not trigger flood-stage on/off rapidly. I think I would prefer to do the auto-release based on the non-compensated number (just like we move to flood-stage based on that), but am curious on your thoughts on this.

Yes, this is a good point. I'll work on addressing that.

I'm no longer sure about this. The compensated disk usage is never less than the non-compensated one, so auto-releasing based on the compensated number has more hysteresis than today's behaviour. Is the additional hysteresis the thing that concerns you @henningandersen or has one of us got our logic backwards?

Yes, that is my concern. Something cured the situation (deleted index, moved shards etc.), but due to an otherwise unrelated recovery, we would keep halting indexing until that recovery is done. I would prefer to resume indexing (i.e., release the block) and then perhaps later find out that we need to halt it again. If cluster is truly full, we will get to that point, but if it is has enough space, we are likely to cure it before hitting the flood stage again anyway.

I think the additional hysteresis is unnecessary. AFAIK, we have not seen that the block was flapping too frequently and using the non-compensated number, we would simply keep the behavior unmodified from this PR.

I see, right. I don't think the additional hysteresis would be much of a problem in practice (if you hit the flood-stage watermark you already lost) and continuing to recover onto a node that's breached the flood-stage watermark is likely the real bug here. However it turns out not to add too much complexity so I adjusted this in d8afaf3.

server/src/main/java/org/elasticsearch/index/shard/StoreRecovery.java

...src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

henningandersen

This looks good, I have just two comments/questions on exposing unknown size/-1 in the API.

henningandersen · 2020-06-24T20:47:28Z

docs/reference/cluster/stats.asciidoc

+(<<byte-units,byte value>>)
+A prediction of how much larger the shard stores will eventually grow due to
+ongoing peer recoveries, restoring snapshots, and similar activities. A value
+of `-1b` indicates that this is not available.


I am slightly torn on the -1b value here. It only takes one shard with unknown reserved size to hide the summarized value. And mostly this is just bad timing anyway, the information we have is not a consistent snapshot across the cluster and interpreting correlations between different parts of the stats in a very precise manner is unlikely to be fruitful.

Also, it looks as if TransportClusterStatsAction.nodeOperation only summarizes started shards, so I wonder if the reserved bytes will ever be anything but 0 here? Am I missing something (would not be the first time 🙂 ).

I think I am in favor of either removing this from cluster stats (if we think it is always 0) or changing StoreStats.add to treat unknown as 0 (or ideally, do everything we can to figure out the expected size like look at primary size, but that is a bigger ask and something for another day).

With this change the cluster-wide reserved size will indeed be zero as long as all nodes in the cluster are sufficiently new. It may be -1b in clusters with a mix of versions since older versions do not know how to supply this value.

With an upcoming change, however, it will be positive for some active shards (namely, searchable snapshots that are still warming up). It shouldn't be unknown for active shards on sufficiently new versions indeed, but I would still rather distinguish "nothing is reserved" from "unknown" in case this is helpful for some future analysis of a diagnostics bundle.

I think of that statement in the opposite way, in that I could be unlucky to get a diagnostics with -1 out and then know nothing about the reserved space at a node summary level.

Also, in the mixed cluster case, I would prefer to assume 0 from old versions. It would otherwise stay -1 at the summarized level until all data nodes are upgraded, which again hides the actual number.

Ok, I pushed 6cee87b with your idea to treat unknown as 0 in the aggregated stats, but still report -1 in the shard-level stats.

I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?

docs/reference/cluster/nodes-stats.asciidoc

henningandersen

LGTM.

henningandersen · 2020-06-30T13:53:36Z

docs/reference/cluster/stats.asciidoc

+(<<byte-units,byte value>>)
+A prediction of how much larger the shard stores will eventually grow due to
+ongoing peer recoveries, restoring snapshots, and similar activities. A value
+of `-1b` indicates that this is not available.


I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?

Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately. Backport of elastic#58029 to 7.x

Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately. Backport of #58029 to 7.x * Picky picky * Missing type

Relates: elastic/elasticsearch#58029

Relates: elastic/elasticsearch#58029 Co-authored-by: Russ Cam <russ.cam@elastic.co>

DaveCTurner added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.9.0 labels Jun 12, 2020

DaveCTurner requested review from ywelsch and henningandersen June 12, 2020 09:35

elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jun 12, 2020

DaveCTurner added the WIP label Jun 12, 2020

Fix up tests

4a6a7cc

DaveCTurner removed the WIP label Jun 12, 2020

DaveCTurner mentioned this pull request Jun 12, 2020

Evaluate inactive replica size by current primary shard in DiskThresh… #52447

Closed

ywelsch reviewed Jun 15, 2020

View reviewed changes

DaveCTurner added 4 commits June 15, 2020 10:59

Tidy

a0e9452

UNKNOWN before marked as recovering

6a7060c

Volatile, no mutex

a5275ed

Always use getIndex().bytesStillToRecover() ignoring stage

50495f2

henningandersen reviewed Jun 15, 2020

View reviewed changes

DaveCTurner added 9 commits June 15, 2020 11:50

Don't mention nodes in cluster stats docs

b89a567

Atomically summarise indices stats

9639659

Merge branch 'master' into 2020-06-12-reserve-bytes-in-store-stats

76d6554

Add skip to REST tests

669e4e5

Strengthen auto-release test to show reserved space affects release

77cdf89

Merge branch 'master' into 2020-06-12-reserve-bytes-in-store-stats

5cad0a5

Clarify

bf9faca

No finally needed

cf23d43

Ignore reserved space when releasing index block

d8afaf3

DaveCTurner added 2 commits June 24, 2020 12:29

single volatile read

78d353e

Construct usageWithReservedSpace later

56ecdb7

henningandersen reviewed Jun 24, 2020

View reviewed changes

DaveCTurner requested a review from henningandersen June 25, 2020 13:41

DaveCTurner added 3 commits June 29, 2020 13:05

Merge branch 'master' into 2020-06-12-reserve-bytes-in-store-stats

62a1944

Ignore reserved bytes in aggregations if unknown

6cee87b

Fix up test

29fb530

henningandersen approved these changes Jun 30, 2020

View reviewed changes

DaveCTurner added 2 commits June 30, 2020 15:13

Merge branch 'master' into 2020-06-12-reserve-bytes-in-store-stats

02c983d

Omniscience

5785002

DaveCTurner merged commit 83d6589 into elastic:master Jul 1, 2020

DaveCTurner deleted the 2020-06-12-reserve-bytes-in-store-stats branch July 1, 2020 07:04

DaveCTurner mentioned this pull request Jul 1, 2020

Account for remaining recovery in disk allocator #58800

Merged

DaveCTurner added the backport pending label Jul 1, 2020

DaveCTurner added a commit that referenced this pull request Jul 1, 2020

Disable BWC tests for backporting of #58029

249753c

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 1, 2020

Enable BWC tests after backport of elastic#58029

e266e73

DaveCTurner added a commit that referenced this pull request Jul 1, 2020

Enable BWC tests after backport of #58029 (#58815)

dcd723a

DaveCTurner removed the backport pending label Jul 1, 2020

russcam mentioned this pull request Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 4, 2020

Add reserved, reserved_in_bytes to node stats response

1ad00ff

Relates: elastic/elasticsearch#58029

russcam mentioned this pull request Aug 4, 2020

Add reserved, reserved_in_bytes to node stats response elastic/elasticsearch-net#4918

Merged

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020

Add reserved, reserved_in_bytes to node stats response (#4918)

3f08a45

Relates: elastic/elasticsearch#58029

github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020

Add reserved, reserved_in_bytes to node stats response (#4918)

790b095

Relates: elastic/elasticsearch#58029

github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020

Add reserved, reserved_in_bytes to node stats response (#4918)

ef6a4c1

Relates: elastic/elasticsearch#58029

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020

Add reserved, reserved_in_bytes to node stats response (#4918) (#4939)

ef4d892

Relates: elastic/elasticsearch#58029 Co-authored-by: Russ Cam <russ.cam@elastic.co>

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020

Add reserved, reserved_in_bytes to node stats response (#4918) (#4940)

a3dd82f

Relates: elastic/elasticsearch#58029 Co-authored-by: Russ Cam <russ.cam@elastic.co>

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for remaining recovery in disk allocator #58029

Account for remaining recovery in disk allocator #58029

DaveCTurner commented Jun 12, 2020

elasticmachine commented Jun 12, 2020

DaveCTurner commented Jun 12, 2020

DaveCTurner commented Jun 12, 2020

ywelsch left a comment

henningandersen left a comment

henningandersen Jun 15, 2020

DaveCTurner Jun 15, 2020

DaveCTurner Jun 18, 2020

henningandersen Jun 24, 2020

DaveCTurner Jun 24, 2020

henningandersen left a comment

henningandersen Jun 24, 2020

DaveCTurner Jun 25, 2020

henningandersen Jun 25, 2020

DaveCTurner Jun 29, 2020

henningandersen Jun 30, 2020

henningandersen left a comment

henningandersen Jun 30, 2020

Account for remaining recovery in disk allocator #58029

Account for remaining recovery in disk allocator #58029

Conversation

DaveCTurner commented Jun 12, 2020

elasticmachine commented Jun 12, 2020

DaveCTurner commented Jun 12, 2020

DaveCTurner commented Jun 12, 2020

ywelsch left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment