-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: automatic ballast files #66493
Comments
Previously I've explored solutions that avoided adding complexity to the KV-layer and rejected queries at the SQL level to stave off disk exhaustion. Blocking at the SQL level either:
@tbg and I brainstormed over slack and came up with a rough proposal. Individual store As a store's disk space reaches various thresholds, it takes increasingly aggressive corrective action.
The above plan ensures that no node ever exits without any available headroom for manual recovery. The graceful shutdown cleanly handles a single node with runaway disk space usage from a Pebble bug, or any other issue local to the node. It's also sufficient for the immediate goal of inhibiting pages in CockroachCloud, so long as SRE tooling can detect that a node gracefully shut down. Underprovisioning The above plan does not ever reject writes, so it does not gracefully handle underprovisioning. If there is not sufficient disk space within a replication zone, the nodes within the zone will trigger the final threshold outlined above. If the stores within a replication zone are underprovisioned, rejecting writes is the only way to avoid nodes exiting/crashing. Rejecting writes due to underprovisioning can happen as a follow on to the above. We don't want to reduce availability artificially. We need to preserve the property that a range with a healthy quorum continues to accept writes, regardless of some subset of nodes' disk space exhaustion. The decision to reject writes is local to the range, and is dependent on its replicas' stores' capacity. If a quorum of the range's replicas are resident on stores that are >98% full, the range begins rejecting writes. This could happen through a similar mechanism as #33007. Disk space exhaustion of replicas' stores would become another trigger for range unavailability. Only KV requests that are annotated as essential or read-only would be permitted. Ballasts To recreate the ballasts, CockroachDB can periodically monitor the available capacity metric. When available capacity is twice the ballast size, it can attempt to recreate the ballast. The node isn't considered to have recovered from the given threshold until it successfully reestablishes the corresponding ballast. |
Did you have anything specific in mind?
We'd discussed a second ballast that would not get deleted and so cause the node to crash. Has that changed here to somehow do the graceful shut down?
Do we need to add this information or is it already available at the leaseholder?
Can you add some stuff about when a ballast gets automatically recreated -- presumably this is based on some metrics? |
I've been thinking of a few ideas:
Ah, I had thought in our discussion that hitting the ENOSPC would trigger the deletion of the second ballast and then a prompt 'graceful' exit. Is there any benefit to trying to perform a coordinated, graceful node stop? The hard stop is simpler.
There's gossiped store capacities maintained in the StorePool and the store IDs of the replicas are readily available. There's some plumbing to connect them together, but it seems workable.
Added a bit to that previous comment. I've been thinking we begin attempting to recreate the ballast when there's available disk space twice the ballast's size. The ability to recreate the ballast serves as the signal of recovery at that threshold. That means if we stop and restart, we determine our current threshold/mode by the existence of the ballasts. |
Thanks for writing this up! It's been a few days and I think this checks out after re-reading, which is a good sign.
Why coordinated? I thought the only coordinated mechanisms are in the allocator and the range circuit breaker; when you're deciding to shut down you do it based on local disk usage only. I think the advantage over stopping proactively rather than just crashing with full disks is that the SREs need a clear exit code to understand that the node is shutting down as a result of running out of disk, and also to provide a clear message to the operator/customer in logs. If the SREs don't need the exit code - for example, because they can automatically mute pages when a cluster is almost full - then we don't need that. But it seems like bad UX to just crash. We also need to prevent restarts, as many deployments will automatically restart the process if it exits, and if we dropped the ballast file we would then fill up until we're truly wedged. Exit codes for CRDB are here: https://github.com/cockroachdb/cockroach/blob/bf4c97d679da6e4f92243f4087c60a2626340f30/pkg/cli/exit/codes.go and ways to prevent the node from restarting are here: cockroach/pkg/base/store_spec.go Lines 460 to 468 in 96880bc
It's available. Stores regularly gossip their descriptor: cockroach/pkg/kv/kvserver/store.go Lines 1970 to 1977 in fcf1eda
and that contains a bunch of information: cockroach/pkg/roachpb/metadata.proto Lines 301 to 335 in e924d91
To be precise, we only consider voting replicas (this is implicit in the mentioning of "quorum", but still) and ">98% full or indisposed for other reasons". |
Strawman:
Still +1 to dedicated exit code, but that alone doesn't solve the problem of allowing SRE to inhibit pages. Aside: Vertical scale in CC console needs to somehow pull the node out of this mode, ideally cause we auto-detect when enough disk space gets added to eliminate need for the failsafe mode. |
@tbg |
Coordinated was poor wood choice. I just meant a deliberate process exit, rather than letting whatever codepath encountered the
I'm worried about lagging capacity metrics causing us to ENOSPC before we have an opportunity to cleanly exit. We have the ability to detect the ENOSPC at the Pebble VFS layer, which could be used to trigger a deliberate process exit. That ensures that we can always cleanly exit, even if a write burst exhausts disk space before the process realizes. We can also perform filesystem operations while handling that We could delete a file though. When manually recovering using the ballast's headroom, the operator would remove the ballast and touch the marker file.
This seems doable. On startup we see that the marker file is missing and check available capacity. If it's sufficiently high, we recreate the marker file and the ballast if necessary. This same codepath can initialize the marker file and ballast on version upgrade.
I think this is possible too, just through a separate codepath as the normal prom metrics exports. What about logs? Today we check |
SO fun.
We'd check for available capacity continuously instead of just once at startup, right?
Make sense. And as long as we are very careful about doing extremely minimal logging, that headroom should be enough? |
Yeah, you're right.
One awkward bit here is that we don't know the node ID until we start the node, so we can't export the |
Note also that
I like that in principle. But this isn't going to be 100% effective unless all filesystem I/O goes through pebble, which it won't. Maybe the real value is in detecting the out of disk condition on the restart? Should be easy - there's no space! At which point the node can be decommissioned in absentia or rescheduled with more space. |
I think we only strictly need the subset of filesystem I/O that is fatal on error to go through the VFS interface, but yeah, it's hard to ensure full coverage. I just realized an additional complication to the marker file: on some filesystems like ZFS, you might not even be able to remove a file since the filesystem metadata is copy-on-write. In these cases, I think you can typically truncate a file, so you could still reclaim space from the ballast.
That's reasonable. No space might actually mean little space, since it's dependent on the size of the allocation request and filesystem used. Maybe on start we exit with an error code if disk space available to the user is < 64MB? |
Yep, something like that is what I thought. I don't know how to sensibly pick the headroom, but if you have an Xmb ballast file it would make sense to consider a disk "full" if it has less than X mb available. This will guarantee that when you delete the ballast file, CRDB will start again. |
Yes, all we need is a signal that some node is out of disk, to use to inhibit all other alerts (& prob notify customer)!
If we crash with a consistent error code, possibly we can use metrics from [1] https://github.com/kubernetes/kube-state-metrics/blob/master/docs/pod-metrics.md
You mean like an operator decides this, right? You don't mean a node gets decommissioned automatically? |
Yep |
Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See cockroachdb#66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field.
This would be great if it's possible. The half-dead, prom-exporting mode is likely to be unexpected for non-CC users and requires some wonkiness to emulate the required endpoint(s). |
Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See cockroachdb#66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field.
Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See cockroachdb#66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. Also, add a new Disk Full (10) exit code that indicates that the node exited because disk space on at least one store is exhausted. On node start, if any store has less than half the ballast's size bytes available, the node immediately exits with the Disk Full (10) exit code. The operator may manually remove the configured ballast (assuming they haven't already) to allow the node to start, and they can take action to remedy the disk space exhaustion. The ballast will automatically be recreated when available disk space is 4x the ballast size, or at least 10 GiB is available after the ballast is created.
Annoyingly, exit codes aren't available in |
Joel says I am not crazy. I think we can make this silence based on the consistent returning of an exist code idea work. I'm a bit nervous since (i) we don't already have https://cockroachlabs.atlassian.net/browse/CC-4553 and (ii) maybe it could get more complex that we expect (relatedly I am not sure I have the time to prototype it now), and (iii) CRDB release cycle is slow. But that's okay, and maybe we can prototype a bit later. Just articulating my feelings for the group. |
Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See cockroachdb#66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. Also, add a new Disk Full (10) exit code that indicates that the node exited because disk space on at least one store is exhausted. On node start, if any store has less than half the ballast's size bytes available, the node immediately exits with the Disk Full (10) exit code. The operator may manually remove the configured ballast (assuming they haven't already) to allow the node to start, and they can take action to remedy the disk space exhaustion. The ballast will automatically be recreated when available disk space is 4x the ballast size, or at least 10 GiB is available after the ballast is created.
Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See cockroachdb#66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. Also, add a new Disk Full (10) exit code that indicates that the node exited because disk space on at least one store is exhausted. On node start, if any store has less than half the ballast's size bytes available, the node immediately exits with the Disk Full (10) exit code. The operator may manually remove the configured ballast (assuming they haven't already) to allow the node to start, and they can take action to remedy the disk space exhaustion. The ballast will automatically be recreated when available disk space is 4x the ballast size, or at least 10 GiB is available after the ballast is created.
66893: cli,storage: add emergency ballast r=jbowens a=jbowens Add an automatically created, on-by-default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. The ballast is automatically created when either available disk space is at least four times the ballast size, or when available disk space after creating the ballast is at least 10 GiB. Creation of the ballast happens either when the engine is opened or during the periodic Capacity calculations driven by the `kvserver.Store`. During node start, if available disk space is less than or equal to half the ballast size, exit immediately with a new Disk Full (10) exit code. See #66493. Release note (ops change): Add an automatically created, on by default emergency ballast file. This new ballast defaults to the minimum of 1% total disk capacity or 1GiB. The size of the ballast may be configured via the `--store` flag with a `ballast-size` field, accepting the same value formats as the `size` field. Also, add a new Disk Full (10) exit code that indicates that the node exited because disk space on at least one store is exhausted. On node start, if any store has less than half the ballast's size bytes available, the node immediately exits with the Disk Full (10) exit code. The operator may manually remove the configured ballast (assuming they haven't already) to allow the node to start, and they can take action to remedy the disk space exhaustion. The ballast will automatically be recreated when available disk space is 4x the ballast size, or at least 10 GiB is available after the ballast is created. 68645: keys/kvprober: introduce a range-local key for probing, use from kvprober r=tbg a=joshimhoff This work sets the stage for extending `kvprober` to do writes as is discussed in detail with @tbg at #67112. **keys: add a range-local key for probing** This commit introduces a range-local key for probing. The key will only be used by probing components like kvprober. This means no contention with user-traffic or other CRDB components. This key also provides a safe place to write to in order to test write availabilty. A kvprober that does writes is coming soon. Release note: None. **kvprober: probe the range-local key dedicated to probing** Before this commit, kvprober probed the start key of a range. This worked okay, as kvprober only did reads, and contention issues leading to false positive pages haven't happened in practice. But contention issues are possible, as there may be data located at the start key of the range. With this commit, kvprober probes the range-local key dedicated to probing. No contention issues are possible, as that key is only for probing. This key is also needed for write probes, which are coming soon. Release note: None. 69164: Revert "backupccl: protect entire keyspan during cluster backup" r=dt a=adityamaru This reverts commit 1b5fd4f. The commit above laid a pts record over the entire table keyspace. This did not account for two things (with the potential of there being more): 1. System tables that we do not backup could have a short GC TTL, and so incremental backups that attempt to protect from `StartTime` of the previous backup would fail. 2. Dropped tables often have a short GC TTL to clear data once they have been dropped. This change would also attempt to protect "dropped but not gc'ed tables" even though we exclude them from the backup, and fail on pts verification. One suggested approach is to exclude all objects we do not backup by subtracting these spans from {TableDataMin, TableDataMax}. This works for system tables, and dropped but not gc'ed tables, but breaks for dropped and gc'ed tables. A pts verification would still find the leaseholder of the empty span and attempt to protect below the gc threshold. In conclusion, we need to think about the semantics a little more before we rush to protect a single key span. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: Josh Imhoff <josh@cockroachlabs.com> Co-authored-by: Aditya Maru <adityamaru@gmail.com>
We pushed ballast changes but rescoped / reprioritized work on write backpressure. @jbowens split out these issues and close the one re: ballast. |
Closing for the ballast bits, and opened #74104. |
Thanks for all the work you did, @jbowens! |
A CockroachDB store may run out of disk space because the operator has insufficient monitoring, an operator is unable to respond in time, etc. A CockroachDB store that exhausts available disk space crashes the node. This is especially problematic within CockroachCloud, where Cockroach Labs SREs are responsible for the availability of CockroachDB but have no control over the customer workload.
Recovering from disk space exhaustion is tricky. Deleting data within CockroachDB requires writing tombstones to disk and writing new immutable sstables before removing old ones. Adding new nodes also requires writing to existing nodes. The current recommended solution is to reserve a limited amount of disk space in a ballast file when initializing stores. If a store exhausts available disk space, the ballast file may be manually removed to provide some headroom to process deletions.
Within the CockroachCloud environment, customers cannot access nodes' filesystems to manage ballasts, and CockroachLabs SRE alerting cannot differentiate between unavailability due to a customer-induced disk space exhaustion and other issues. Additionally, customer support has observed some on-premise customers forget to setup ballasts, accidentally risking data loss.
In CockroachDB 21.1 running out-of-disk induces a panic loop on the affected node. The loss of one node triggers upreplicating and precipitates additional out-of-disk nodes.
Background
In the absence of replica placement constraints, CockroachDB will rebalance to avoid exhausting a store's available disk space, as allotted to the store through the
--store
flag'ssize
field. When a store is 92.5% full, the allocator prevents rebalancing to the store. When a store is 95% full, the allocator actively tries to move replicas away from the store.Previous discussion: #8473, https://groups.google.com/a/cockroachlabs.com/g/storage/c/E-_x0EvcoaY/m/sz1fE8OBAgAJ https://docs.google.com/document/d/1yAN9aiXuhuXKMnlWFyZr4UOHzdE1xOW3vOHKFgr7gwE/edit?usp=sharing
The text was updated successfully, but these errors were encountered: