Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: out of disk space #8473

Closed
BramGruneir opened this issue Aug 11, 2016 · 41 comments
Closed

storage: out of disk space #8473

BramGruneir opened this issue Aug 11, 2016 · 41 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@BramGruneir
Copy link
Member

A node should fail gracefully when out of disk space.

@BramGruneir BramGruneir added this to the Q3 milestone Aug 11, 2016
@BramGruneir BramGruneir self-assigned this Aug 11, 2016
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Aug 25, 2016
testServerArgs will now take StoreSpecs instead of just a number of stores.
This also adds StoreSpecsPerNode on testClusterArgs to enable per node store
settings. Currently, only in-memory stores are supported.

Part of work towards cockroachdb#8473.
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Aug 25, 2016
testServerArgs will now take StoreSpecs instead of just a number of stores.
This also adds ServerArgsPerNode on testClusterArgs to enable per server
customizable settings. Currently, only in-memory stores are supported.

Part of work towards cockroachdb#8473.
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Sep 28, 2016
testServerArgs will now take StoreSpecs instead of just a number of stores.
This also adds ServerArgsPerNode on testClusterArgs to enable per server
customizable settings. Currently, only in-memory stores are supported.

Part of work towards cockroachdb#8473.
@BramGruneir
Copy link
Member Author

BramGruneir commented Sep 29, 2016

There are 3 ways in which we will mitigate out of disk space errors.

  1. Generate a simple metric that will bubble up to the admin ui (and prometheus and other reporting systems) when a store is in a low disk space range (<10% left) and critical (<5% left).
  2. Panic/Stop a node when we hit a critical disk space limit (before rocksDB does). This should be around <1% left.
  3. Consider moving back to %capacity left disk usage instead of range counts for rebalancing. This will solve the heterogeneous clusters (clusters with different disk sizes) issue.

I will create issues for each of these and close out this one.

@petermattis
Copy link
Collaborator

I'm surprised performing range-level flow-control isn't on that list. Rebalancing alone cannot prevent running into an out of disk space situation.

PS We already of alerting on our test clusters for low-disk space. I'm not sure there is anything additional to do for 1.

@BramGruneir
Copy link
Member Author

For the metrics, I'm going to look into the current ones and make sure we also surface them to our admin ui. Or improve them if needed.

Range-level flow-control is required, but I was looking for tractable changes that we can do right now. I'll add an issue for that as well.

Also, after chatting with @bdarnell about this, I think he's correct in saying we should not be doing any type of replica/range freezes or other fancy solutions. Best to fail naturally to avoid any corruption.

@petermattis
Copy link
Collaborator

Probably worthwhile to consider out of disk space scenarios:

  • Runaway process (outside of CockroachDB's control) starts sucking up disk space. For example, logging excessively. This can happen rapidly, maybe even faster than we can rebalance ranges away.
  • Poor rebalancing heuristics move too many ranges to a node. This should be considered a bug which we should fix.
  • Natural range growth. A node with 10,000 32MB ranges (312.5 GB) can see each of them double in size (625 GB) causing it to run out of disk space. Rebalancing can help here as long as there is capacity in the cluster. If the cluster is out of space, panicing seems problematic. How do I read data from a full cluster?

@petermattis
Copy link
Collaborator

Best to fail naturally to avoid any corruption.

Is there any evidence that we corrupt data when we run out of disk space? I thought the problem was that a node crashed when that happened. RocksDB certainly shouldn't corrupt data if it can't write.

@BramGruneir
Copy link
Member Author

Runaway process (outside of CockroachDB's control) starts sucking up disk space. For example, logging excessively. This can happen rapidly, maybe even faster than we can rebalance ranges away.

There's only so much we can do here. By panicking we allow the admin time to clean up the disk or moving the data to a larger drive before starting the node back up.

Poor rebalancing heuristics move too many ranges to a node. This should be considered a bug which we should fix.

Agreed. We already have safeguards in place for this, but It would be nice to have some of the other options, including the high/low watermark levels for rebalancing, but as of right now, I don't think they would really help. Moving to rebalancing based on size instead of range count would go a long way to fixing our issues.

Natural range growth. A node with 10,000 32MB ranges (312.5 GB) can see each of them double in size (625 GB) causing it to run out of disk space. Rebalancing can help here as long as there is capacity in the cluster. If the cluster is out of space, panicing seems problematic. How do I read data from a full cluster?

I see reading data from a cluster that is over capacity as a non-issue at this point. We might be able to add a mechanism to do so from the command line or some other solution, but for now, my advice would be to copy the data to machines with more disk space and restart the cluster. This could easily be a ver 2 or 3 feature, but shouldn't be a goal for 1.0.

Is there any evidence that we corrupt data when we run out of disk space? I thought the problem was that a node crashed when that happened. RocksDB certainly shouldn't corrupt data if it can't write.

I went looking and but I can't find clear evidence. But I did find some things that depend on rocksdb that found after disk full errors their checksums were not correct (after recovery). RocksDB tends to just suggest freeing up space (facebook/rocksdb#919) But why even let it get that far? If we drain the node when close to full, it would ensure we do our best to not cross that line. Obviously, there's no guarantee that we would be able to stop rocksDB from panicking as we only have so much control over disk space.

@BramGruneir
Copy link
Member Author

BramGruneir commented Sep 29, 2016

Some other thoughts on this issue.

It would be really nice to know at the range level to prevent kv writes from a nearly full disk and just return a disk full error (at the sql level this could be the postgres error code 51300 - DISK FULL). This can be achieved by having the leader check the free space on all stores that have a replica and if any are too full, reject the write. This does become extremely tricky if one of the affected ranges contains any meta or other important system ranges, so perhaps we ignore this write blocking on those system ranges.


I'd like to expand on the concept of rebalancing based on %free space left instead range count. For clarity, %free is calculated as the minimum of either what's left of the defined space quota on the store or the total free space in a node.
This has a number of benefits.

  • Splits would no longer be a reason to start rebalancing.
  • We already have the concept of thresholds, so if we look to stay within 5% of the mean, we should consider ourselves balanced.

We can add some other types of thresholds to this as well, such as, as long as all stores are less than 50% used, just rebalance based on total bytes and only when we start using more than 50% of a store move to %free. This would keep everything even until there is some pressure on the system to become offset.

But there are some complications here that will need to be addressed about zones. We need to calculate the mean %free per zone and not just of the overall system. But I think this is best left for future work.


Also, as an unanswered question, do we still record metrics when we're full?
Perhaps we can start recording them at slower rates?
Or do we just consider them like any other data?
Do we prune old metrics right now?
Should we prune more when under stress due to lack of space?

@cuongdo
Copy link
Contributor

cuongdo commented Sep 29, 2016

I agree with moving back to percentage disk space used in principle. Previously, I believe we were looking at overall disk space used (everything on disk, including OS) instead of disk space used just by the store. So, if we use the store's live bytes, that should solve the problem we were originally seeing: that with small clusters (live bytes <<< size of OS and other files in bytes), no rebalancing was occurring at all.

As always with rebalancing changes, please test with small clusters (these are where first impressions are formed) and larger clusters. We want the rebalancer to work smoothly as the cluster goes from no data to some large amount of data.

I have a vague worry about using percentages but nothing concrete yet. I'll continue to think about that.


Regarding metrics, I don't think we should do anything special for them for now, especially since pruning requires more Raft commands (meaning more disk space consumed).

@petermattis
Copy link
Collaborator

There's only so much we can do here. By panicking we allow the admin time to clean up the disk or moving the data to a larger drive before starting the node back up.

Panicking is a big hammer. Having the cluster grind to a halt when it runs out of disk space really isn't acceptable.

Moving to rebalancing based on size instead of range count would go a long way to fixing our issues.
...
I'd like to expand on the concept of rebalancing based on %free space left instead range count.

Are the current rebalancing heuristics actually a problem? I haven't seen any evidence of that yet. We already rebalance away from any node that is >95% full.

I went looking and but I can't find clear evidence. But I did find some things that depend on rocksdb that found after disk full errors their checksums were not correct (after recovery). RocksDB tends to just suggest freeing up space (facebook/rocksdb#919) But why even let it get that far? If we drain the node when close to full, it would ensure we do our best to not cross that line. Obviously, there's no guarantee that we would be able to stop rocksDB from panicking as we only have so much control over disk space.

We might not be able to drain the node (the entire cluster is full). I don't see a point in panicking at some certain percent fullness vs panicking when we're completely out of space. I do see a large benefit in refusing writes at a certain fullness (i.e. entering a read+delete only mode). If we panic at 99% full, how do you rebalance ranges off the node? Consider this scenario, you start up a 3-node cluster writing data to it. The nodes fill up and then hit your panic threshold and crash. You add 2 additional nodes to the cluster but you can't actually restart it because the 3 existing nodes are already at capacity and panic every time they are started. No bueno.

As always with rebalancing changes, please test with small clusters (these are where first impressions are formed) and larger clusters. We want the rebalancer to work smoothly as the cluster goes from no data to some large amount of data.

Agreed, but my strong suspicion is that trying to address out-of-disk via rebalancing heuristics is the wrong tool. We undoubtedly need additional improvements to the rebalancing heuristics and will be improving them for the foreseeable future, but out-of-disk space requires a new mechanism to handle real-world scenarios of interest.

@tbg
Copy link
Member

tbg commented Sep 29, 2016

Panicking is a big hammer. Having the cluster grind to a halt when it runs out of disk space really isn't acceptable.

It's not optimal, but I just don't think there is much to win here by over-engineering at this point. There are very obvious venues to iterate on as this becomes a real-world concern. Let's take trivial precautions now and run some acceptance test clusters with asymmetric disk and see how efficiently the rebalancer manages to get data off that node. Turning into a read-only cluster when the disk is full seems like such a fringe feature. To compare, this is how much Postgres cares.

@cuongdo
Copy link
Contributor

cuongdo commented Sep 29, 2016

In the short-term, I also support a simple panic when we have critically
low disk space. There are just so many things that can prevent a more
graceful degradation of service with low disk, especially when so much of
our system depends on being able to write to disk, even for reads.
On Thu, Sep 29, 2016 at 5:49 PM Tobias Schottdorf notifications@github.com
wrote:

Panicking is a big hammer. Having the cluster grind to a halt when it runs
out of disk space really isn't acceptable.

It's not optimal, but I just don't think there is much to win here by
over-engineering at this point. There are very obvious venues to iterate on
as this becomes a real-world concern. Let's take trivial precautions now
and run some acceptance test clusters with asymmetric disk and see how
efficiently the rebalancer manages to get data off that node. Turning into
a read-only cluster when the disk is full seems like such a fringe feature.
To compare, this is how much Postgres cares
https://www.postgresql.org/docs/9.6/static/disk-full.html.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#8473 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpq3hoaK044mJU_oiM29RInlbF5X4ks5qvDJbgaJpZM4JiXUg
.

@petermattis
Copy link
Collaborator

To be clear, I'm not arguing we should be doing something complex at this time. I think panicking when we're critically low on disk space is a bit strange. IMO, better to just run up to the out-of-disk limit. If we want to give the user the flexibility to recover from the out-of-disk situation then we need to reserve some disk space they can release. A threshold we panic at would need to be configurable so that we can disable it in order to allow an near-out-of-disk node to start up (otherwise the node is permanently wedged).

Here is a docs only solution (we could bake this into the cockroach binary itself if we wanted):

CockroachDB will crash when it runs out of disk space. As an administrator, it may be prudent to reserve a portion of disk space ahead of time so that there is space that can be freed up in an emergency. For example, the following command will reserve 5 GB of disk space in a file named reserved:

dd if=/dev/zero of=reserved bs=$[1024 * 1024] count=$[5 * 1024]

@bdarnell
Copy link
Contributor

Having the cluster grind to a halt when it runs out of disk space really isn't acceptable.

But I think it's inevitable, since even read-only operation requires renewing leases, etc.

I think that panicking when nearly full is a better outcome than trying to run right up to the limit (in which case rocksdb will panic for us).

A threshold we panic at would need to be configurable so that we can disable it in order to allow an near-out-of-disk node to start up (otherwise the node is permanently wedged).

It's wedged until you can free up space, which is exactly what would happen if we let the space be completely exhausted. The difference is that having a little bit of slack which might help you free up that space (for example, we might have a CLI command to do a rocksdb compaction which might be able to free up space). Or you have room to compress some files instead of just deleting them, etc.

+1 to recommending a ballast file that can be deleted in emergencies.

@petermattis
Copy link
Collaborator

I think that panicking when nearly full is a better outcome than trying to run right up to the limit (in which case rocksdb will panic for us).

I don't see much of a difference. Can you expand on why you think an early panic by us is better than waiting for rocksdb to panic?

(for example, we might have a CLI command to do a rocksdb compaction which might be able to free up space)

This could be tricky given that a RocksDB compaction needs disk space in order to free up disk space.

Turning into a read-only cluster when the disk is full seems like such a fringe feature. To compare, this is how much Postgres cares.

Interesting, though I'd argue that this is exactly an area we should care about more than Postgres. Consider my earlier example where you start a 3-node cluster and then all the nodes become full. One of the promises of CockroachDB is easy/seamless scalability. It shouldn't take herculean efforts (e.g. copying the node data to bigger disks) to unwedge such a cluster. Rather, our story should be: add a few more nodes and the cluster will automatically be ready to use again.

Note, I'm not arguing for something complex here, just a reasonable answer to this scenario. Using a ballast (I like that term) file might be sufficient: if your cluster is out of disk space, add additional capacity via additional nodes, delete the ballast files on existing nodes and restart the cluster. If cockroach managed the ballast files itself this could be done rather seamlessly.

@bdarnell
Copy link
Contributor

I think that panicking when nearly full is a better outcome than trying to run right up to the limit (in which case rocksdb will panic for us).

I don't see much of a difference. Can you expand on why you think an early panic by us is better than waiting for rocksdb to panic?

(for example, we might have a CLI command to do a rocksdb compaction which might be able to free up space)

This could be tricky given that a RocksDB compaction needs disk space in order to free up disk space.

This is exactly why an early panic could be desirable. Panicking isn't ideal (and if rocksdb didn't already panic when out of disk space I probably wouldn't recommend it), but it's easy.

@petermattis
Copy link
Collaborator

This could be tricky given that a RocksDB compaction needs disk space in order to free up disk space.
This is exactly why an early panic could be desirable. Panicking isn't ideal (and if rocksdb didn't already panic when out of disk space I probably wouldn't recommend it), but it's easy.

If we believe that RocksDB compaction is a viable way to free up disk space, we need to verify how much free space that requires. I would guess it potentially needs several times that max sstable size (currently 128 MB), but it might need more.

@BramGruneir
Copy link
Member Author

BramGruneir commented Sep 30, 2016

The ballast file is an interesting idea, but could we not achieve the same thing without the need to just hoard some disk space? If we panic (well quit, not panic) when we're at less than 1% free space but make this configurable via a command line flag on start. By being able to turn off this autoquit, it would save the extra space (assuming no other sources are using up disk space) and allow for compaction or allow the admin to add new storage and relieve the disk space pressure.

I'm going to investigate if compaction is a way to free up disk space from an already running cluster.

@BramGruneir
Copy link
Member Author

BramGruneir commented Sep 30, 2016

Currently, we perform a compactions at a rate of around 6 per 5 mins (on gamma) and 3 per 10 mins on rho. I very much doubt that we're going to gain a lot of space by forcing a compaction when we're near the disk space limit. If we only compacted every hour it might be the case, but not when we are compacting every min or so.

@petermattis
Copy link
Collaborator

The compaction metrics are measuring individual compactions involving a few sstables, not full compactions as would be performed from the command line. The individual compactions free up a relatively small amount of space. I don't think your current analysis is valid. What would be better is to take a snapshot from one of the nodes on gamma (i.e. copy the data directory), record the size of the directory (e.g. using du -sh), run ./cockroach debug compact and then measure the size of the directory again.

The ballast file is an interesting idea, but could we not achieve the same thing without the need to just hoard some disk space?

The ballast guarantees you have disk space you can free. Without reserving disk space in that way some process (inside Cockroach) may create data that we can't delete. With the ballast we are guaranteed that we'll panic before using all of the disk space and then have a known amount of disk space that can be freed.

@spencerkimball
Copy link
Member

It's hard to support ideas like compaction or down-replication. They are non-trivial complications and worse, they either won't work at all or won't work for long.

I think we'll put ourselves on a solid footing and can close this bug for now (and let @BramGruneir work on something else) if we:

  • Change our allocator to have the following logic: If a node is at or above the low free space threshold, it should no longer be considered as a rebalancing target; further, it should be considered as a source for rebalancings, in the same way as a node which is greater than the threshold % higher than the mean range count. Otherwise, if the node is below the low free space threshold, the normal allocation decisions apply (compare to the threshold above or below the mean range count).
  • Panic above a critical space threshold.

This will keep us from pushing any node to its critical threshold if there is other space in the cluster. If the entire cluster is full, then we'll panic. But that seems like a good first step.

@petermattis
Copy link
Collaborator

Change our allocator to have the following logic: If a node is at or above the low free space threshold, it should no longer be considered as a rebalancing target; further, it should be considered as a source for rebalancings

The logic you propose is already present in the allocator. See rangeCountBalancer.shouldRebalance and the maxCapacityUsed condition. Also see rangeCountBalancer.selectRandom which is used to select a target for rebalancing. We never select a store which is more than 95% full as a target for rebalancing and we always consider a store that is more than 95% full as a source for rebalancing.

I'm leaning towards doing either nothing in the near term to address this issue, or adding support for the ballast file.

@garvitjuniwal
Copy link
Contributor

@dianasaur323 The main concern we have is around how to recover a cockroach node from the scenario where disks are full. What we observed in one of our test is that a node hangs indefinitely upon an ENOSPC error. That might have been fixed by #19287 but when testing with the fix, we ran in more serious correctness issues (possibly related to #16004 (comment).) Haven't gotten a chance to try our test with the fix for that.

In another case, we ended up filling up a cockroach cluster to 100% on all nodes accidentally, and there was no way for us to recover from that situation, for example by truncating a table. We have to wipe the cluster and start over.

My expectations around space issues are that:

  1. Cockroach shuts down gracefully on ENOSPC. It might be a better user experience that cockroach stays up in a minimal mode where writes fail but you can do basic things like node status, but I don't see this as a strict requirement. No requests should hang.

  2. There should be a way to quickly recover after a cluster is full. Procedurally this can mean that a ballast file (say 5% of disk capacity) be created on the cockroach disk to begin with, and cockroach shuts down when disks become close to 100% full. To recover from this situation, user should first shut down their applications that are trying to add data to the cluster. Be able to bring up cockroach after replacing the ballast file with a smaller ballast file. Then,

  • If there is data in the cluster that the user can remove, they should be able to quickly remove data and get it to be garbage collected. It should be easy for the user to tell which tables are taking up most space on disk, how much of that data is historical, and how much is live, current data.
  • If no data in the cluster can be deleted, user should add one or more new cockroach nodes to the cluster. Cockroach should then rebalance the existing data.

The ballast file should then be resized to the original size, and all operations can resume.

@dianasaur323
Copy link
Contributor

@garvitjuniwal thanks for the detailed write-up! This is very helpful. Let me circle up with some people internally to see if we can address some of these issues in the next release / propose some solutions. I know there is already ongoing work going on in some other issues as well, but I don't think they directly address this one.

@dianasaur323
Copy link
Contributor

Ok, I'm back with some updates after speaking with @tschottdorf. With 1, I believe we crash now, so even though it's not necessarily pretty, it shouldn't hang anymore. With 2, this seems to be a good suggestion that could be combined with 4. We could probably think about automating this, but for now, we might have to go with a manual workaround before we have time to provide a better UX here. With 3, we have an open issue about that here: #19329

@BramGruneir
Copy link
Member Author

@dianasaur323, assigning this to you as I haven't been involved with in a long time.

@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Apr 27, 2018
@knz knz assigned kannanlakshmi and unassigned dianasaur323 May 9, 2018
@tbg tbg added the A-storage Relating to our storage engine (Pebble) on-disk storage. label May 15, 2018
@tbg
Copy link
Member

tbg commented May 17, 2018

When a node is 95% full, does it transfer its leases away (and do other nodes stop transferring leases to it)? If so, it seems that we're handling this reasonably well. If a single node runs out of disk, it should try to get rid of both leases and data until it drops below the 95% treshold. The only potentially missing piece is alerting the operator to this fact.

Seems like we'd find out if we roachtested this.

@tbg
Copy link
Member

tbg commented May 17, 2018

I think #25051 gets close to testing this already, except it fills all nodes' disks and writes data that gets GCed fast enough. We could add a flavor of it that has some nodes with enough space, and writes non-gc'able data and makes sure that the full nodes don't run out of disk space.

@a-robinson
Copy link
Contributor

When a node is 95% full, does it transfer its leases away (and do other nodes stop transferring leases to it)?

That's false as written. Disk fullness doesn't affect lease transfer decisions. It's true if you s/leases/replicas/.

@tbg
Copy link
Member

tbg commented May 17, 2018

Can't be false, since it's a question! Thanks for answering. The way I was hoping this would work is that the leaseholder would decide to transferrebalance itself away, and to achieve that it would transfer the lease. Are leaseholders just stuck on their respective nodes?

@a-robinson
Copy link
Contributor

Can't be false, since it's a question!

:)

The way I was hoping this would work is that the leaseholder would decide to transferrebalance itself away, and to achieve that it would transfer the lease. Are leaseholders just stuck on their respective nodes?

Sorry, I didn't get that's what you were asking. Leaseholders do transfer their lease away if they want to remove themselves from the range:

if removeReplica.StoreID == repl.store.StoreID() {
// The local replica was selected as the removal target, but that replica
// is the leaseholder, so transfer the lease instead. We don't check that
// the current store has too many leases in this case under the
// assumption that replica balance is a greater concern. Also note that
// AllocatorRemove action takes preference over AllocatorConsiderRebalance
// (rebalancing) which is where lease transfer would otherwise occur. We
// need to be able to transfer leases in AllocatorRemove in order to get
// out of situations where this store is overfull and yet holds all the
// leases. The fullness checks need to be ignored for cases where
// a replica needs to be removed for constraint violations.
transferred, err := rq.transferLease(
ctx,
repl,
desc,
zone,
transferLeaseOptions{
dryRun: dryRun,
},
)
if err != nil {
return false, err
}
// Do not requeue as we transferred our lease away.
if transferred {
return false, nil
}

@tbg
Copy link
Member

tbg commented May 17, 2018

Ok. Then, assuming we don't write too fast, that roachtest I'm suggesting above should work, and it's all a matter of making it actually work, correct?

@a-robinson
Copy link
Contributor

I would like for it to work. I wouldn't bet much on it given the questionable behavior I saw of rocksdb's compactions filling up the disk on otherwised unused nodes in #22387.

@tbg
Copy link
Member

tbg commented May 17, 2018

Good point. Really need to figure out how to make DeleteFilesInRange work ;)

@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 12, 2018

folding into #7782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests