Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: automatic ballast files #66493

Closed
jbowens opened this issue Jun 15, 2021 · 20 comments
Closed

storage: automatic ballast files #66493

jbowens opened this issue Jun 15, 2021 · 20 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Jun 15, 2021

A CockroachDB store may run out of disk space because the operator has insufficient monitoring, an operator is unable to respond in time, etc. A CockroachDB store that exhausts available disk space crashes the node. This is especially problematic within CockroachCloud, where Cockroach Labs SREs are responsible for the availability of CockroachDB but have no control over the customer workload.

Recovering from disk space exhaustion is tricky. Deleting data within CockroachDB requires writing tombstones to disk and writing new immutable sstables before removing old ones. Adding new nodes also requires writing to existing nodes. The current recommended solution is to reserve a limited amount of disk space in a ballast file when initializing stores. If a store exhausts available disk space, the ballast file may be manually removed to provide some headroom to process deletions.

Within the CockroachCloud environment, customers cannot access nodes' filesystems to manage ballasts, and CockroachLabs SRE alerting cannot differentiate between unavailability due to a customer-induced disk space exhaustion and other issues. Additionally, customer support has observed some on-premise customers forget to setup ballasts, accidentally risking data loss.

In CockroachDB 21.1 running out-of-disk induces a panic loop on the affected node. The loss of one node triggers upreplicating and precipitates additional out-of-disk nodes.

Background
In the absence of replica placement constraints, CockroachDB will rebalance to avoid exhausting a store's available disk space, as allotted to the store through the --store flag's size field. When a store is 92.5% full, the allocator prevents rebalancing to the store. When a store is 95% full, the allocator actively tries to move replicas away from the store.

Previous discussion: #8473, https://groups.google.com/a/cockroachlabs.com/g/storage/c/E-_x0EvcoaY/m/sz1fE8OBAgAJ https://docs.google.com/document/d/1yAN9aiXuhuXKMnlWFyZr4UOHzdE1xOW3vOHKFgr7gwE/edit?usp=sharing

@jbowens jbowens added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team labels Jun 15, 2021
@jbowens jbowens self-assigned this Jun 15, 2021
@jbowens
Copy link
Collaborator Author

jbowens commented Jun 15, 2021

Previously I've explored solutions that avoided adding complexity to the KV-layer and rejected queries at the SQL level to stave off disk exhaustion. Blocking at the SQL level either:

  • doesn't guarantee you'll avoid exhaustion because queries may enter other nodes that are not out-of-disk, or
  • unacceptably reduces availability. CockroachDB zone configurations dictate where data may be replicated, and disk space exhaustion is a property of a replication zone, not a cluster.

@tbg and I brainstormed over slack and came up with a rough proposal.

Individual store

As a store's disk space reaches various thresholds, it takes increasingly aggressive corrective action.

  • When a store is 92.5% full the allocator stops moving ranges to the store. This is current behavior.
  • When a store is 95% full the allocator starts moving data off the store with high priority. This is current behavior. If replica placement constraints allow a replica to be relocated to one of the remaining stores with <92.5% disk usage, the store will shed replicas. We may want to additionally adjust Pebble options to be more aggressive about reclaiming disk space, and be more restrictive about the type of allowed background operations that may pin sstables.
  • When a store is just about full (>99.5%?, <1gb available?), the store's node gracefully shuts down. Shutting down before total exhaustion allows an operator to manually recover. Graceful shutdown may also be used to inhibit SRE pages in the CockroachCloud environment.

The above plan ensures that no node ever exits without any available headroom for manual recovery. The graceful shutdown cleanly handles a single node with runaway disk space usage from a Pebble bug, or any other issue local to the node. It's also sufficient for the immediate goal of inhibiting pages in CockroachCloud, so long as SRE tooling can detect that a node gracefully shut down.

Underprovisioning

The above plan does not ever reject writes, so it does not gracefully handle underprovisioning. If there is not sufficient disk space within a replication zone, the nodes within the zone will trigger the final threshold outlined above. If the stores within a replication zone are underprovisioned, rejecting writes is the only way to avoid nodes exiting/crashing. Rejecting writes due to underprovisioning can happen as a follow on to the above.

We don't want to reduce availability artificially. We need to preserve the property that a range with a healthy quorum continues to accept writes, regardless of some subset of nodes' disk space exhaustion. The decision to reject writes is local to the range, and is dependent on its replicas' stores' capacity. If a quorum of the range's replicas are resident on stores that are >98% full, the range begins rejecting writes. This could happen through a similar mechanism as #33007. Disk space exhaustion of replicas' stores would become another trigger for range unavailability. Only KV requests that are annotated as essential or read-only would be permitted.

Ballasts
Today the allocator relies on the capacity metrics that are reported by the engine. It'll likely be more reliable to use ballasts that are transparently removed at the storage layer (see cockroachdb/pebble#1164) to trigger some of these thresholds. That ensures lagging metrics don't negate these protections.

To recreate the ballasts, CockroachDB can periodically monitor the available capacity metric. When available capacity is twice the ballast size, it can attempt to recreate the ballast. The node isn't considered to have recovered from the given threshold until it successfully reestablishes the corresponding ballast.

@sumeerbhola
Copy link
Collaborator

We may want to additionally adjust Pebble options to be more aggressive about reclaiming disk space, and be more restrictive about the type of allowed background operations that may pin sstables.

Did you have anything specific in mind?

When a store is just about full (>99.5%?, <1gb available?), the store's node gracefully shuts down.

We'd discussed a second ballast that would not get deleted and so cause the node to crash. Has that changed here to somehow do the graceful shut down?

If a quorum of the range's replicas are resident on stores that are >98% full, the range begins rejecting writes.

Do we need to add this information or is it already available at the leaseholder?

It'll likely be more reliable to use ballasts that are transparently removed at the storage layer

Can you add some stuff about when a ballast gets automatically recreated -- presumably this is based on some metrics?

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 16, 2021

Did you have anything specific in mind?

I've been thinking of a few ideas:

  • reducing compaction concurrency would help keep the in-flight compaction bloat down, which could help make the disk usage numbers more stable
  • reducing the maxExpandedBytes used by the compaction picker would also help keep in-flight compaction bloat down
  • increasing the DeleteRangeFlushDelay would reduce the number of flushes from range deletions and make deletion-only compactions more common since more range deletions can be flushed together, yielding wider merged range deletions.
  • reducing the threshold for elision-only compactions
  • increasing the weight assigned to range and point deletion estimates in compensated sizes

We'd discussed a second ballast that would not get deleted and so cause the node to crash. Has that changed here to somehow do the graceful shut down?

Ah, I had thought in our discussion that hitting the ENOSPC would trigger the deletion of the second ballast and then a prompt 'graceful' exit. Is there any benefit to trying to perform a coordinated, graceful node stop? The hard stop is simpler.

Do we need to add this information or is it already available at the leaseholder?

There's gossiped store capacities maintained in the StorePool and the store IDs of the replicas are readily available. There's some plumbing to connect them together, but it seems workable.

Can you add some stuff about when a ballast gets automatically recreated -- presumably this is based on some metrics?

Added a bit to that previous comment. I've been thinking we begin attempting to recreate the ballast when there's available disk space twice the ballast's size. The ability to recreate the ballast serves as the signal of recovery at that threshold. That means if we stop and restart, we determine our current threshold/mode by the existence of the ballasts.

@tbg
Copy link
Member

tbg commented Jun 17, 2021

Thanks for writing this up! It's been a few days and I think this checks out after re-reading, which is a good sign.

Ah, I had thought in our discussion that hitting the ENOSPC would trigger the deletion of the second ballast and then a prompt 'graceful' exit. Is there any benefit to trying to perform a coordinated, graceful node stop? The hard stop is simpler.

Why coordinated? I thought the only coordinated mechanisms are in the allocator and the range circuit breaker; when you're deciding to shut down you do it based on local disk usage only. I think the advantage over stopping proactively rather than just crashing with full disks is that the SREs need a clear exit code to understand that the node is shutting down as a result of running out of disk, and also to provide a clear message to the operator/customer in logs. If the SREs don't need the exit code - for example, because they can automatically mute pages when a cluster is almost full - then we don't need that. But it seems like bad UX to just crash. We also need to prevent restarts, as many deployments will automatically restart the process if it exits, and if we dropped the ballast file we would then fill up until we're truly wedged.

Exit codes for CRDB are here: https://github.com/cockroachdb/cockroach/blob/bf4c97d679da6e4f92243f4087c60a2626340f30/pkg/cli/exit/codes.go
We can configure pebble with a callback that then invokes os.Exit with the right exit code (or something like that).

and ways to prevent the node from restarting are here:

// PreventedStartupFile returns the path to a file which, if it exists, should
// prevent the server from starting up. Returns an empty string for in-memory
// engines.
func (ss StoreSpec) PreventedStartupFile() string {
if ss.InMemory {
return ""
}
return PreventedStartupFile(filepath.Join(ss.Path, AuxiliaryDir))
}

If a quorum of the range's replicas are resident on stores that are >98% full, the range begins rejecting writes.

Do we need to add this information or is it already available at the leaseholder?

It's available. Stores regularly gossip their descriptor:

// GossipStore broadcasts the store on the gossip network.
func (s *Store) GossipStore(ctx context.Context, useCached bool) error {
// Temporarily indicate that we're gossiping the store capacity to avoid
// recursively triggering a gossip of the store capacity.
syncutil.StoreFloat64(&s.gossipQueriesPerSecondVal, -1)
syncutil.StoreFloat64(&s.gossipWritesPerSecondVal, -1)
storeDesc, err := s.Descriptor(ctx, useCached)

and that contains a bunch of information:

// StoreCapacity contains capacity information for a storage device.
message StoreCapacity {
option (gogoproto.goproto_stringer) = false;
// Total capacity of the disk used by the store, including space used by the
// operating system and other applications.
optional int64 capacity = 1 [(gogoproto.nullable) = false];
// Available space remaining on the disk used by the store.
optional int64 available = 2 [(gogoproto.nullable) = false];
// Amount of disk space used by the data in the CockroachDB store. Note that
// this is going to be less than (capacity - available), because those two
// fields consider the entire disk and everything on it, while this only
// tracks the store's disk usage.
optional int64 used = 8 [(gogoproto.nullable) = false];
// Amount of logical bytes stored in the store, ignoring RocksDB space
// overhead. Useful for rebalancing so that moving a replica from one store
// to another actually removes its bytes from the source store even though
// RocksDB may not actually reclaim the physical disk space for a while.
optional int64 logical_bytes = 9 [(gogoproto.nullable) = false];
optional int32 range_count = 3 [(gogoproto.nullable) = false];
optional int32 lease_count = 4 [(gogoproto.nullable) = false];
// queries_per_second tracks the average number of queries processed per
// second by replicas in the store. The stat is tracked over the time period
// defined in storage/replica_stats.go, which as of July 2018 is 30 minutes.
optional double queries_per_second = 10 [(gogoproto.nullable) = false];
// writes_per_second tracks the average number of keys written per second
// by ranges in the store. The stat is tracked over the time period defined
// in storage/replica_stats.go, which as of July 2018 is 30 minutes.
optional double writes_per_second = 5 [(gogoproto.nullable) = false];
// bytes_per_replica and writes_per_replica contain percentiles for the
// number of bytes and writes-per-second to each replica in the store.
// This information can be used for rebalancing decisions.
optional Percentiles bytes_per_replica = 6 [(gogoproto.nullable) = false];
optional Percentiles writes_per_replica = 7 [(gogoproto.nullable) = false];
}

If a quorum of the range's replicas are resident on stores that are >98% full, the range begins rejecting writes.

To be precise, we only consider voting replicas (this is implicit in the mentioning of "quorum", but still) and ">98% full or indisposed for other reasons".

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 17, 2021

We also need to prevent restarts, as many deployments will automatically restart the process if it exits, and if we dropped the ballast file we would then fill up until we're truly wedged.

Strawman:

  1. Write something to the disk indicating you are in EOD failsafe mode (touch a file; there better be space for that lol).
  2. Crash.
  3. When you come up again (as supervisors like k8s will restart the process as per Tobias), don't start up the server really, if you see the EOD failsafe mode file written to disk somewhere, but also don't crash.
  4. Don't start the server really but DO export a prom metric that indicates node is in EOD failsafe mode; SRE can use that to inhibit alerts, even ones from sqlprober (we don't alert just on crashes).

Still +1 to dedicated exit code, but that alone doesn't solve the problem of allowing SRE to inhibit pages.

Aside: Vertical scale in CC console needs to somehow pull the node out of this mode, ideally cause we auto-detect when enough disk space gets added to eliminate need for the failsafe mode.

@mwang1026
Copy link

@tbg >98% full or indisposed for other reasons does this mean that there will be a general range circuit breaker and ?98% full is one trigger for the circuit breaker? Does this work then rely on there being a circuit breaker?

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 17, 2021

Why coordinated? I thought the only coordinated mechanisms are in the allocator and the range circuit breaker; when you're deciding to shut down you do it based on local disk usage only.

Coordinated was poor wood choice. I just meant a deliberate process exit, rather than letting whatever codepath encountered the ENOSPC crash the process.

We also need to prevent restarts, as many deployments will automatically restart the process if it exits, and if we dropped the ballast file we would then fill up until we're truly wedged.

Write something to the disk indicating you are in EOD failsafe mode (touch a file; there better be space for that lol).

I'm worried about lagging capacity metrics causing us to ENOSPC before we have an opportunity to cleanly exit. We have the ability to detect the ENOSPC at the Pebble VFS layer, which could be used to trigger a deliberate process exit. That ensures that we can always cleanly exit, even if a write burst exhausts disk space before the process realizes. We can also perform filesystem operations while handling that ENOSPC, however if we exhausted space, it's entirely possible that we don't have capacity for writing a marker file indicating that we exited.

We could delete a file though. When manually recovering using the ballast's headroom, the operator would remove the ballast and touch the marker file.

Aside: Vertical scale in CC console needs to somehow pull the node out of this mode, ideally cause we auto-detect when enough disk space gets added to eliminate need for the failsafe mode.

This seems doable. On startup we see that the marker file is missing and check available capacity. If it's sufficiently high, we recreate the marker file and the ballast if necessary. This same codepath can initialize the marker file and ballast on version upgrade.

Don't start the server really but DO export a prom metric that indicates node is in EOD failsafe mode; SRE can use that to inhibit alerts, even ones from sqlprober (we don't alert just on crashes).

I think this is possible too, just through a separate codepath as the normal prom metrics exports.


What about logs? Today we check PriorCriticalAlertError after initializing logging, which requires creating files. It would also be very hard to tell what the cockroach process was doing if we entered a long-running prom-metric-exporting mode without any logging. If we automatically deleted the ballast instead of letting the operator delete it, we could have some headroom for the log file rotation.

@joshimhoff
Copy link
Collaborator

We could delete a file though.

SO fun.

On startup we see that the marker file is missing and check available capacity.

We'd check for available capacity continuously instead of just once at startup, right?

What about logs? If we automatically deleted the ballast instead of letting the operator delete it, we could have some headroom for the log file rotation.

Make sense. And as long as we are very careful about doing extremely minimal logging, that headroom should be enough?

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 23, 2021

We'd check for available capacity continuously instead of just once at startup, right?

Yeah, you're right.

Don't start the server really but DO export a prom metric that indicates node is in EOD failsafe mode; SRE can use that to inhibit alerts, even ones from sqlprober (we don't alert just on crashes).

One awkward bit here is that we don't know the node ID until we start the node, so we can't export the node_id gauge. Is that okay on your end? Eg, /_status/vars would literally contain a single gauge node_stopped_on_full_disk with value 1.

@tbg
Copy link
Member

tbg commented Jun 23, 2021

Note also that /_status/vars is pending inclusion into the v2 API, so having weird one-off code paths that expose a stub endpoint might cause additional headaches. cc @thtruo
It should be possible, though. We can house a mock version of whatever endpoints are expected to be present. We should also host a health endpoint with something in it. It's easy to get sucked into too much work here. Maybe SREs can figure out a way to get what they want without the need for a separate metric, through conditions to be exposed at startup by k8s.

We could delete a file though. When manually recovering using the ballast's headroom, the operator would remove the ballast and touch the marker file.

I like that in principle. But this isn't going to be 100% effective unless all filesystem I/O goes through pebble, which it won't.

Maybe the real value is in detecting the out of disk condition on the restart? Should be easy - there's no space! At which point the node can be decommissioned in absentia or rescheduled with more space.

@jbowens
Copy link
Collaborator Author

jbowens commented Jun 23, 2021

I like that in principle. But this isn't going to be 100% effective unless all filesystem I/O goes through pebble, which it won't.

I think we only strictly need the subset of filesystem I/O that is fatal on error to go through the VFS interface, but yeah, it's hard to ensure full coverage.

I just realized an additional complication to the marker file: on some filesystems like ZFS, you might not even be able to remove a file since the filesystem metadata is copy-on-write. In these cases, I think you can typically truncate a file, so you could still reclaim space from the ballast.

Maybe the real value is in detecting the out of disk condition on the restart? Should be easy - there's no space! At which point the node can be decommissioned in absentia or rescheduled with more space.

That's reasonable. No space might actually mean little space, since it's dependent on the size of the allocation request and filesystem used. Maybe on start we exit with an error code if disk space available to the user is < 64MB?

@tbg
Copy link
Member

tbg commented Jun 24, 2021

That's reasonable. No space might actually mean little space, since it's dependent on the size of the allocation request and filesystem used. Maybe on start we exit with an error code if disk space available to the user is < 64MB?

Yep, something like that is what I thought. I don't know how to sensibly pick the headroom, but if you have an Xmb ballast file it would make sense to consider a disk "full" if it has less than X mb available. This will guarantee that when you delete the ballast file, CRDB will start again.

@joshimhoff
Copy link
Collaborator

One awkward bit here is that we don't know the node ID until we start the node, so we can't export the node_id gauge. Is that okay on your end? Eg, /_status/vars would literally contain a single gauge node_stopped_on_full_disk with value 1.

Yes, all we need is a signal that some node is out of disk, to use to inhibit all other alerts (& prob notify customer)!

Maybe SREs can figure out a way to get what they want without the need for a separate metric, through conditions to be exposed at startup by k8s.

If we crash with a consistent error code, possibly we can use metrics from kube-state-metrics [1] such as kube_pod_container_status_last_terminated_reason. I would need to some experiments to see if this will work reliably.

[1] https://github.com/kubernetes/kube-state-metrics/blob/master/docs/pod-metrics.md

At which point the node can be decommissioned in absentia or rescheduled with more space.

You mean like an operator decides this, right? You don't mean a node gets decommissioned automatically?

@tbg
Copy link
Member

tbg commented Jun 25, 2021

You mean like an operator decides this, right?

Yep

jbowens added a commit to jbowens/cockroach that referenced this issue Jun 25, 2021
Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size, or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See cockroachdb#66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field.
@jbowens
Copy link
Collaborator Author

jbowens commented Jun 25, 2021

If we crash with a consistent error code, possibly we can use metrics from kube-state-metrics [1] such as kube_pod_container_status_last_terminated_reason. I would need to some experiments to see if this will work reliably.

This would be great if it's possible. The half-dead, prom-exporting mode is likely to be unexpected for non-CC users and requires some wonkiness to emulate the required endpoint(s).

@jbowens jbowens assigned tbg and unassigned tbg Jun 25, 2021
jbowens added a commit to jbowens/cockroach that referenced this issue Jun 28, 2021
Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size, or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See cockroachdb#66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field.
jbowens added a commit to jbowens/cockroach that referenced this issue Jun 29, 2021
Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size, or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See cockroachdb#66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field. Also, add a new Disk Full (10) exit
code that indicates that the node exited because disk space on at least
one store is exhausted. On node start, if any store has less than half
the ballast's size bytes available, the node immediately exits with the
Disk Full (10) exit code. The operator may manually remove the
configured ballast (assuming they haven't already) to allow the node to
start, and they can take action to remedy the disk space exhaustion. The
ballast will automatically be recreated when available disk space is 4x
the ballast size, or at least 10 GiB is available after the ballast is
created.
@joshimhoff
Copy link
Collaborator

If we crash with a consistent error code, possibly we can use metrics from kube-state-metrics [1] such as kube_pod_container_status_last_terminated_reason. I would need to some experiments to see if this will work reliably.

Annoyingly, exit codes aren't available in kube_state_metrics, just "exit reasons" (always returns "error" for non-zero exit codes). I see no off the shelf tool to get this data into prometheus, though maybe I'll find one with more research. Building the tool ourselves is possible and might be useful in other situations as provides observability with history we don't have today (could open source it really). I wrote a CC issue here. I am getting Joel to take a look today and tell me if I'm crazy: https://cockroachlabs.atlassian.net/browse/CC-4553

@joshimhoff
Copy link
Collaborator

Joel says I am not crazy. I think we can make this silence based on the consistent returning of an exist code idea work. I'm a bit nervous since (i) we don't already have https://cockroachlabs.atlassian.net/browse/CC-4553 and (ii) maybe it could get more complex that we expect (relatedly I am not sure I have the time to prototype it now), and (iii) CRDB release cycle is slow. But that's okay, and maybe we can prototype a bit later. Just articulating my feelings for the group.

jbowens added a commit to jbowens/cockroach that referenced this issue Aug 13, 2021
Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size, or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See cockroachdb#66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field. Also, add a new Disk Full (10) exit
code that indicates that the node exited because disk space on at least
one store is exhausted. On node start, if any store has less than half
the ballast's size bytes available, the node immediately exits with the
Disk Full (10) exit code. The operator may manually remove the
configured ballast (assuming they haven't already) to allow the node to
start, and they can take action to remedy the disk space exhaustion. The
ballast will automatically be recreated when available disk space is 4x
the ballast size, or at least 10 GiB is available after the ballast is
created.
jbowens added a commit to jbowens/cockroach that referenced this issue Aug 19, 2021
Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See cockroachdb#66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field. Also, add a new Disk Full (10) exit
code that indicates that the node exited because disk space on at least
one store is exhausted. On node start, if any store has less than half
the ballast's size bytes available, the node immediately exits with the
Disk Full (10) exit code. The operator may manually remove the
configured ballast (assuming they haven't already) to allow the node to
start, and they can take action to remedy the disk space exhaustion. The
ballast will automatically be recreated when available disk space is 4x
the ballast size, or at least 10 GiB is available after the ballast is
created.
craig bot pushed a commit that referenced this issue Aug 19, 2021
66893: cli,storage: add emergency ballast  r=jbowens a=jbowens

Add an automatically created, on-by-default emergency ballast file. This
new ballast defaults to the minimum of 1% total disk capacity or 1GiB.
The size of the ballast may be configured via the `--store` flag with a
`ballast-size` field, accepting the same value formats as the `size`
field.

The ballast is automatically created when either available disk space is
at least four times the ballast size, or when available disk space after
creating the ballast is at least 10 GiB. Creation of the ballast happens
either when the engine is opened or during the periodic Capacity
calculations driven by the `kvserver.Store`.

During node start, if available disk space is less than or equal to half
the ballast size, exit immediately with a new Disk Full (10) exit code.

See #66493.

Release note (ops change): Add an automatically created, on by default
emergency ballast file. This new ballast defaults to the minimum of 1%
total disk capacity or 1GiB.  The size of the ballast may be configured
via the `--store` flag with a `ballast-size` field, accepting the same
value formats as the `size` field. Also, add a new Disk Full (10) exit
code that indicates that the node exited because disk space on at least
one store is exhausted. On node start, if any store has less than half
the ballast's size bytes available, the node immediately exits with the
Disk Full (10) exit code. The operator may manually remove the
configured ballast (assuming they haven't already) to allow the node to
start, and they can take action to remedy the disk space exhaustion. The
ballast will automatically be recreated when available disk space is 4x
the ballast size, or at least 10 GiB is available after the ballast is
created.

68645: keys/kvprober: introduce a range-local key for probing, use from kvprober r=tbg a=joshimhoff

This work sets the stage for extending `kvprober` to do writes as is discussed in detail with @tbg at #67112.

**keys: add a range-local key for probing**

This commit introduces a range-local key for probing. The key will
only be used by probing components like kvprober. This means no
contention with user-traffic or other CRDB components. This key also provides
a safe place to write to in order to test write availabilty. A kvprober that
does writes is coming soon.

Release note: None.

**kvprober: probe the range-local key dedicated to probing**

Before this commit, kvprober probed the start key of a range. This worked okay,
as kvprober only did reads, and contention issues leading to false positive
pages haven't happened in practice. But contention issues are possible,
as there may be data located at the start key of the range.

With this commit, kvprober probes the range-local key dedicated to
probing. No contention issues are possible, as that key is only for
probing. This key is also needed for write probes, which are coming soon.

Release note: None.

69164: Revert "backupccl: protect entire keyspan during cluster backup" r=dt a=adityamaru

This reverts commit 1b5fd4f.

The commit above laid a pts record over the entire table keyspace.
This did not account for two things (with the potential of there being
more):

1. System tables that we do not backup could have a short GC TTL, and
so incremental backups that attempt to protect from `StartTime` of the
previous backup would fail.

2. Dropped tables often have a short GC TTL to clear data once they have
been dropped. This change would also attempt to protect "dropped but not
gc'ed tables" even though we exclude them from the backup, and fail on
pts verification.

One suggested approach is to exclude all objects we do not backup by
subtracting these spans from {TableDataMin, TableDataMax}. This works
for system tables, and dropped but not gc'ed tables, but breaks for
dropped and gc'ed tables. A pts verification would still find the leaseholder
of the empty span and attempt to protect below the gc threshold.

In conclusion, we need to think about the semantics a little more before
we rush to protect a single key span.

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: Josh Imhoff <josh@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
@mwang1026
Copy link

We pushed ballast changes but rescoped / reprioritized work on write backpressure. @jbowens split out these issues and close the one re: ballast.

@jbowens jbowens changed the title storage: graceful out-of-disk handling storage: automatic ballast files Dec 20, 2021
@jbowens
Copy link
Collaborator Author

jbowens commented Dec 20, 2021

Closing for the ballast bits, and opened #74104.

@jbowens jbowens closed this as completed Dec 20, 2021
@joshimhoff
Copy link
Collaborator

Thanks for all the work you did, @jbowens!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

5 participants