Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: More improvements for rebalancing/compacting when disks are nearly full #22235

Merged
merged 3 commits into from
Feb 5, 2018

Conversation

a-robinson
Copy link
Contributor

Follow-up to #21866. Only the last three commits are new. This is the full version that I'll start testing on a real cluster with small disks.

Fixes #21400 (pending testing on a real cluster)

@a-robinson a-robinson requested review from petermattis and a team January 30, 2018 23:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

:lgtm: though someone else should put eyes on this as well.

The growing number of thresholds is something we should take a look at in 2.1. The interactions are becoming difficult to predict.


Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.


pkg/storage/compactor/compactor.go, line 334 at r9 (raw file):

		c.Metrics.CompactingNanos.Inc(int64(duration))
		if c.doneFn != nil {
			c.doneFn(ctx, "manual rocksdb compaction")

Not sure if manual is correct here as the compaction was automatically performed by cockroach.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

@bdarnell can you spare a second set of eyes?

The growing number of thresholds is something we should take a look at in 2.1. The interactions are becoming difficult to predict.

Indeed, it's getting ugly in places.


Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/compactor/compactor.go, line 334 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not sure if manual is correct here as the compaction was automatically performed by cockroach.

That's fair. Changed.


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Review status: 0 of 9 files reviewed at latest revision, all discussions resolved, some commit checks broke.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Alright, so the first round of real cluster testing with this looked better than the control group running on master, but still isn't good. My test was to set up two 6-node clusters (one running this, one running master), put a 5GB ballast file on the 10GB disk of 3 of the nodes in each cluster, and then run kv --sequential --min-block-bytes=256 --max-block-bytes=257 on all of the nodes. My notes are below:

  • New code made it farther before crashing -- stored more real data, so much moreso that one of the 10GB nodes managed to fill up in addition to the 3 that had a ballast file
  • Recovery from fullness is terrible -- I shut off all load (from here onwards, no SQL queries were being run), tried truncating more and more size from the ballast files and kept failing:
    • First deleted only 32MiB from each ballast file, which understandably didn't work - all nodes died shortly after restarting
    • Then deleted another 512MiB -- still all the full nodes ran out of disk shortly after restarting
    • Then deleted another 1GiB -- and still all the full nodes re-filled up their disks and died. That's ridiculous.
    • Then I deleted what was left of the ballast files and wiped the disk of the non-ballast node that filled up, and the clusters recovered
  • A ton of bytes got queued up for compaction at each restart, but from the logs it doesn't look like any more than one compaction per node (of an estimated 200-300 MBs) finished before they died
    • Maybe we actually do want exclusive_manual_compaction = true for very low disk scenarios to make them clear up space faster and stall other writes?
  • We started rebalancing to the previously full nodes during recovery, but that makes sense given that they weren't full anymore
    • During the previous attempts at recovery, on the new code we were up-replicating to the full nodes (until they hit 95% fullness) because all ranges legitimately only had 2 replicas (due to one of the 3 non-ballast nodes having filled up and died), which may have hurt the experiment. That's unfortunate, but I'm not sure whether we want to change to avoid up-replicating to nearly full nodes (in addition to not rebalancing to them).
  • The new code actually failed again -- a bunch of the up-replication mentioned above sent replicas their way, which when combined with slow rocksdb compactions of all the old, GC'ed replicas apparently used up too much disk on 2 of the 3 nodes
    • When the 2 nodes filled up and failed, there was no replication activity on the cluster and no user writes. The only things using space were thus background write traffic (timeseries, node liveness, etc.) and rocksdb compactions. Since we don't write timeseries data that quickly, I'm not sure how the failures could have been due to anything other than rocksdb compactions. Rocksdb needs to do better.

My next step is to try to understand what's going on in these compactions that's making them fill the disk (e.g. by examining the rocksdb manifest), whether it's expected behavior, and what we can do about it. I'm planning to put that off a bit and focus on #19985 today, though, so if anyone has thoughts in the meantime I'm all ears.

@nvanbenschoten
Copy link
Member

:lgtm:

Another thing to consider is that you're running on really small disks, so the breathing room provided by the fractional usage thresholds (maxFractionUsedThreshold, rebalanceToMaxFractionUsedThreshold) isn't much (<512MB). Perhaps we also want a flat threshold in addition to these fractional thresholds for small disk deployments.

A ton of bytes got queued up for compaction at each restart

This is by our Compactor or by RocksDB? If it was the latter then exclusive_manual_compaction makes sense to try out.


Reviewed 2 of 2 files at r7, 2 of 2 files at r8, 2 of 3 files at r9, 1 of 1 files at r10.
Review status: 4 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.


pkg/storage/compactor/compactor.go, line 100 at r10 (raw file):

type storeCapacityFunc func() (roachpb.StoreCapacity, error)

type doneCompactingFunc func(ctx context.Context, reason string)

reason leaking into this signature seems misplaced. Slight preference for

type doneCompactingFunc func(ctx context.Context)

...

s.compactor = compactor.NewCompactor(s.engine.(engine.WithSSTables), s.Capacity, func(ctx context.Context) {
    s.asyncGossipStore(ctx, "compactor-initiated rocksdb compaction")
})

Comments from Reviewable

@a-robinson
Copy link
Contributor Author

@petermattis is your take that this and #21866 should go in or that they should wait until after we understand/fix the problems I ran into when trying to recover from the full disks?

This is by our Compactor or by RocksDB? If it was the latter then exclusive_manual_compaction makes sense to try out.

By our compactor.

@petermattis
Copy link
Collaborator

I think this can go in as-is given that this is a definite improvement over the current state. Would be nice to fully understanding the problem with recovering from full disks, but I'm also a believer in incremental progress.


Review status: 4 of 9 files reviewed at latest revision, 1 unresolved discussion, some commit checks broke.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Another thing to consider is that you're running on really small disks, so the breathing room provided by the fractional usage thresholds (maxFractionUsedThreshold, rebalanceToMaxFractionUsedThreshold) isn't much (<512MB). Perhaps we also want a flat threshold in addition to these fractional thresholds for small disk deployments.

Yeah, it's very possible that's contributing to the problems here. I'll mention that in the follow-up issue.


Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion.


pkg/storage/compactor/compactor.go, line 100 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

reason leaking into this signature seems misplaced. Slight preference for

type doneCompactingFunc func(ctx context.Context)

...

s.compactor = compactor.NewCompactor(s.engine.(engine.WithSSTables), s.Capacity, func(ctx context.Context) {
    s.asyncGossipStore(ctx, "compactor-initiated rocksdb compaction")
})

Done.


Comments from Reviewable

I'm not sure whether we should also do this for raft snapshots -- is
it better for a node to run out of disk or get stuck in a behind state
that causes other nodes to keep trying to send it snapshots?

Release note: None
Touches cockroachdb#21400

Release note: Free up disk space more aggressively when the disk is
closer to full.
This helps all nodes' allocators have more up-to-date capacity
information sooner after a significant change.

Touches cockroachdb#21400

Release note: None
@a-robinson a-robinson merged commit 3581b56 into cockroachdb:master Feb 5, 2018
@a-robinson a-robinson deleted the fulldisk2 branch May 18, 2018 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants