-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for remaining recovery in disk allocator #58029
Account for remaining recovery in disk allocator #58029
Conversation
Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately.
Pinging @elastic/es-distributed (:Distributed/Allocation) |
Although I ran the |
Ok it was a lot of noise from relatively few broken tests, this is good to go now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some minor comments, overall looking good though.
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial review and left some initial comments.
server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java
Outdated
Show resolved
Hide resolved
@@ -140,6 +140,10 @@ public void onNewInfo(ClusterInfo info) { | |||
final DiskUsage usage = entry.value; | |||
final RoutingNode routingNode = routingNodes.node(node); | |||
|
|||
final long reservedSpace = info.getReservedSpace(usage.getNodeId(), usage.getPath()).getTotal(); | |||
final DiskUsage usageWithReservedSpace = new DiskUsage(usage.getNodeId(), usage.getNodeName(), usage.getPath(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS this compensation affects two functionalities:
- Triggering a reroute when going under low threshold
- Auto-releasing indices when disk usage goes below high threshold.
I think 1) is fine, but I am a bit concerned about 2), since the primary reason we only release when under high threshold is to have some hysteresis to not trigger flood-stage on/off rapidly. I think I would prefer to do the auto-release based on the non-compensated number (just like we move to flood-stage based on that), but am curious on your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good point. I'll work on addressing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no longer sure about this. The compensated disk usage is never less than the non-compensated one, so auto-releasing based on the compensated number has more hysteresis than today's behaviour. Is the additional hysteresis the thing that concerns you @henningandersen or has one of us got our logic backwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is my concern. Something cured the situation (deleted index, moved shards etc.), but due to an otherwise unrelated recovery, we would keep halting indexing until that recovery is done. I would prefer to resume indexing (i.e., release the block) and then perhaps later find out that we need to halt it again. If cluster is truly full, we will get to that point, but if it is has enough space, we are likely to cure it before hitting the flood stage again anyway.
I think the additional hysteresis is unnecessary. AFAIK, we have not seen that the block was flapping too frequently and using the non-compensated number, we would simply keep the behavior unmodified from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, right. I don't think the additional hysteresis would be much of a problem in practice (if you hit the flood-stage watermark you already lost) and continuing to recover onto a node that's breached the flood-stage watermark is likely the real bug here. However it turns out not to add too much complexity so I adjusted this in d8afaf3.
server/src/main/java/org/elasticsearch/index/shard/StoreRecovery.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I have just two comments/questions on exposing unknown size/-1 in the API.
(<<byte-units,byte value>>) | ||
A prediction of how much larger the shard stores will eventually grow due to | ||
ongoing peer recoveries, restoring snapshots, and similar activities. A value | ||
of `-1b` indicates that this is not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly torn on the -1b value here. It only takes one shard with unknown reserved size to hide the summarized value. And mostly this is just bad timing anyway, the information we have is not a consistent snapshot across the cluster and interpreting correlations between different parts of the stats in a very precise manner is unlikely to be fruitful.
Also, it looks as if TransportClusterStatsAction.nodeOperation
only summarizes started shards, so I wonder if the reserved bytes will ever be anything but 0 here? Am I missing something (would not be the first time 🙂 ).
I think I am in favor of either removing this from cluster stats (if we think it is always 0) or changing StoreStats.add to treat unknown as 0 (or ideally, do everything we can to figure out the expected size like look at primary size, but that is a bigger ask and something for another day).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change the cluster-wide reserved size will indeed be zero as long as all nodes in the cluster are sufficiently new. It may be -1b in clusters with a mix of versions since older versions do not know how to supply this value.
With an upcoming change, however, it will be positive for some active shards (namely, searchable snapshots that are still warming up). It shouldn't be unknown for active shards on sufficiently new versions indeed, but I would still rather distinguish "nothing is reserved" from "unknown" in case this is helpful for some future analysis of a diagnostics bundle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of that statement in the opposite way, in that I could be unlucky to get a diagnostics with -1 out and then know nothing about the reserved space at a node summary level.
Also, in the mixed cluster case, I would prefer to assume 0 from old versions. It would otherwise stay -1 at the summarized level until all data nodes are upgraded, which again hides the actual number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I pushed 6cee87b with your idea to treat unknown as 0
in the aggregated stats, but still report -1 in the shard-level stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
(<<byte-units,byte value>>) | ||
A prediction of how much larger the shard stores will eventually grow due to | ||
ongoing peer recoveries, restoring snapshots, and similar activities. A value | ||
of `-1b` indicates that this is not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can then leave out the -1 notice here? May still need to be on the node-stats due to the shard level info, but AFAICS, it can never be -1 at cluster stats level?
Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately. Backport of elastic#58029 to 7.x
Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately. Backport of #58029 to 7.x * Picky picky * Missing type
Relates: elastic/elasticsearch#58029 Co-authored-by: Russ Cam <russ.cam@elastic.co>
Relates: elastic/elasticsearch#58029 Co-authored-by: Russ Cam <russ.cam@elastic.co>
Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.
This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.