storage: More improvements for rebalancing/compacting when disks are nearly full #22235

a-robinson · 2018-01-30T23:40:46Z

Follow-up to #21866. Only the last three commits are new. This is the full version that I'll start testing on a real cluster with small disks.

Fixes #21400 (pending testing on a real cluster)

cockroach-teamcity · 2018-01-30T23:40:56Z

This change is

petermattis · 2018-01-31T01:12:48Z

though someone else should put eyes on this as well.

The growing number of thresholds is something we should take a look at in 2.1. The interactions are becoming difficult to predict.

Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.

pkg/storage/compactor/compactor.go, line 334 at r9 (raw file):

		c.Metrics.CompactingNanos.Inc(int64(duration))
		if c.doneFn != nil {
			c.doneFn(ctx, "manual rocksdb compaction")

Not sure if manual is correct here as the compaction was automatically performed by cockroach.

Comments from Reviewable

a-robinson · 2018-01-31T15:48:12Z

@bdarnell can you spare a second set of eyes?

The growing number of thresholds is something we should take a look at in 2.1. The interactions are becoming difficult to predict.

Indeed, it's getting ugly in places.

Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/compactor/compactor.go, line 334 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not sure if manual is correct here as the compaction was automatically performed by cockroach.

That's fair. Changed.

Comments from Reviewable

bdarnell · 2018-01-31T16:12:27Z

Review status: 0 of 9 files reviewed at latest revision, all discussions resolved, some commit checks broke.

Comments from Reviewable

a-robinson · 2018-02-01T15:31:47Z

Alright, so the first round of real cluster testing with this looked better than the control group running on master, but still isn't good. My test was to set up two 6-node clusters (one running this, one running master), put a 5GB ballast file on the 10GB disk of 3 of the nodes in each cluster, and then run kv --sequential --min-block-bytes=256 --max-block-bytes=257 on all of the nodes. My notes are below:

New code made it farther before crashing -- stored more real data, so much moreso that one of the 10GB nodes managed to fill up in addition to the 3 that had a ballast file
Recovery from fullness is terrible -- I shut off all load (from here onwards, no SQL queries were being run), tried truncating more and more size from the ballast files and kept failing:
- First deleted only 32MiB from each ballast file, which understandably didn't work - all nodes died shortly after restarting
- Then deleted another 512MiB -- still all the full nodes ran out of disk shortly after restarting
- Then deleted another 1GiB -- and still all the full nodes re-filled up their disks and died. That's ridiculous.
- Then I deleted what was left of the ballast files and wiped the disk of the non-ballast node that filled up, and the clusters recovered
A ton of bytes got queued up for compaction at each restart, but from the logs it doesn't look like any more than one compaction per node (of an estimated 200-300 MBs) finished before they died
- Maybe we actually do want exclusive_manual_compaction = true for very low disk scenarios to make them clear up space faster and stall other writes?
We started rebalancing to the previously full nodes during recovery, but that makes sense given that they weren't full anymore
- During the previous attempts at recovery, on the new code we were up-replicating to the full nodes (until they hit 95% fullness) because all ranges legitimately only had 2 replicas (due to one of the 3 non-ballast nodes having filled up and died), which may have hurt the experiment. That's unfortunate, but I'm not sure whether we want to change to avoid up-replicating to nearly full nodes (in addition to not rebalancing to them).
The new code actually failed again -- a bunch of the up-replication mentioned above sent replicas their way, which when combined with slow rocksdb compactions of all the old, GC'ed replicas apparently used up too much disk on 2 of the 3 nodes
- When the 2 nodes filled up and failed, there was no replication activity on the cluster and no user writes. The only things using space were thus background write traffic (timeseries, node liveness, etc.) and rocksdb compactions. Since we don't write timeseries data that quickly, I'm not sure how the failures could have been due to anything other than rocksdb compactions. Rocksdb needs to do better.

My next step is to try to understand what's going on in these compactions that's making them fill the disk (e.g. by examining the rocksdb manifest), whether it's expected behavior, and what we can do about it. I'm planning to put that off a bit and focus on #19985 today, though, so if anyone has thoughts in the meantime I'm all ears.

nvanbenschoten · 2018-02-01T16:21:14Z

Another thing to consider is that you're running on really small disks, so the breathing room provided by the fractional usage thresholds (maxFractionUsedThreshold, rebalanceToMaxFractionUsedThreshold) isn't much (<512MB). Perhaps we also want a flat threshold in addition to these fractional thresholds for small disk deployments.

A ton of bytes got queued up for compaction at each restart

This is by our Compactor or by RocksDB? If it was the latter then exclusive_manual_compaction makes sense to try out.

Reviewed 2 of 2 files at r7, 2 of 2 files at r8, 2 of 3 files at r9, 1 of 1 files at r10.
Review status: 4 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.

pkg/storage/compactor/compactor.go, line 100 at r10 (raw file):

type storeCapacityFunc func() (roachpb.StoreCapacity, error)

type doneCompactingFunc func(ctx context.Context, reason string)

reason leaking into this signature seems misplaced. Slight preference for

type doneCompactingFunc func(ctx context.Context)

...

s.compactor = compactor.NewCompactor(s.engine.(engine.WithSSTables), s.Capacity, func(ctx context.Context) {
    s.asyncGossipStore(ctx, "compactor-initiated rocksdb compaction")
})

Comments from Reviewable

a-robinson · 2018-02-05T18:30:43Z

@petermattis is your take that this and #21866 should go in or that they should wait until after we understand/fix the problems I ran into when trying to recover from the full disks?

This is by our Compactor or by RocksDB? If it was the latter then exclusive_manual_compaction makes sense to try out.

By our compactor.

petermattis · 2018-02-05T18:49:56Z

I think this can go in as-is given that this is a definite improvement over the current state. Would be nice to fully understanding the problem with recovering from full disks, but I'm also a believer in incremental progress.

Review status: 4 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.

Comments from Reviewable

a-robinson · 2018-02-05T19:00:29Z

Another thing to consider is that you're running on really small disks, so the breathing room provided by the fractional usage thresholds (maxFractionUsedThreshold, rebalanceToMaxFractionUsedThreshold) isn't much (<512MB). Perhaps we also want a flat threshold in addition to these fractional thresholds for small disk deployments.

Yeah, it's very possible that's contributing to the problems here. I'll mention that in the follow-up issue.

Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion.

pkg/storage/compactor/compactor.go, line 100 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

reason leaking into this signature seems misplaced. Slight preference for

type doneCompactingFunc func(ctx context.Context)

...

s.compactor = compactor.NewCompactor(s.engine.(engine.WithSSTables), s.Capacity, func(ctx context.Context) {
    s.asyncGossipStore(ctx, "compactor-initiated rocksdb compaction")
})

Done.

Comments from Reviewable

I'm not sure whether we should also do this for raft snapshots -- is it better for a node to run out of disk or get stuck in a behind state that causes other nodes to keep trying to send it snapshots? Release note: None

Touches cockroachdb#21400 Release note: Free up disk space more aggressively when the disk is closer to full.

This helps all nodes' allocators have more up-to-date capacity information sooner after a significant change. Touches cockroachdb#21400 Release note: None

a-robinson requested review from petermattis and a team January 30, 2018 23:40

a-robinson mentioned this pull request Jan 30, 2018

storage: Improve/test rebalancing behavior when cluster is nearly full #21866

Merged

a-robinson force-pushed the fulldisk2 branch from ffab41a to bbfb246 Compare January 30, 2018 23:55

a-robinson force-pushed the fulldisk2 branch from bbfb246 to 4bb1e9d Compare January 31, 2018 15:45

a-robinson force-pushed the fulldisk2 branch from 4bb1e9d to ab7a9df Compare February 5, 2018 19:00

a-robinson force-pushed the fulldisk2 branch from ab7a9df to e3a057a Compare February 5, 2018 19:16

a-robinson mentioned this pull request Feb 5, 2018

roachtest: test recovery from out of disk after removing ballast file #22387

Closed

a-robinson force-pushed the fulldisk2 branch from e3a057a to 9ac0083 Compare February 5, 2018 19:38

a-robinson added 3 commits February 5, 2018 14:38

storage: Reject snapshots too large to fit on a store

22038f9

I'm not sure whether we should also do this for raft snapshots -- is it better for a node to run out of disk or get stuck in a behind state that causes other nodes to keep trying to send it snapshots? Release note: None

compactor: Run compactions more aggressively when low on disk space

7531347

Touches cockroachdb#21400 Release note: Free up disk space more aggressively when the disk is closer to full.

compactor: Gossip store capacity after each compaction run

73d9a03

This helps all nodes' allocators have more up-to-date capacity information sooner after a significant change. Touches cockroachdb#21400 Release note: None

a-robinson force-pushed the fulldisk2 branch from 9ac0083 to 73d9a03 Compare February 5, 2018 19:38

a-robinson merged commit 3581b56 into cockroachdb:master Feb 5, 2018

kaavee315 mentioned this pull request Apr 13, 2018

Roach test for disk space usage #24795

Closed

a-robinson deleted the fulldisk2 branch May 18, 2018 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: More improvements for rebalancing/compacting when disks are nearly full #22235

storage: More improvements for rebalancing/compacting when disks are nearly full #22235

a-robinson commented Jan 30, 2018

cockroach-teamcity commented Jan 30, 2018

petermattis commented Jan 31, 2018

a-robinson commented Jan 31, 2018

bdarnell commented Jan 31, 2018

a-robinson commented Feb 1, 2018

nvanbenschoten commented Feb 1, 2018

a-robinson commented Feb 5, 2018

petermattis commented Feb 5, 2018

a-robinson commented Feb 5, 2018

storage: More improvements for rebalancing/compacting when disks are nearly full #22235

storage: More improvements for rebalancing/compacting when disks are nearly full #22235

Conversation

a-robinson commented Jan 30, 2018

cockroach-teamcity commented Jan 30, 2018

petermattis commented Jan 31, 2018

a-robinson commented Jan 31, 2018

bdarnell commented Jan 31, 2018

a-robinson commented Feb 1, 2018

nvanbenschoten commented Feb 1, 2018

a-robinson commented Feb 5, 2018

petermattis commented Feb 5, 2018

a-robinson commented Feb 5, 2018