-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: consistency check should only checkpoint relevant range #90543
Comments
cc @cockroachdb/replication |
@jbowens The majority of the work here will be in Pebble, should I write up a separate issue there? |
@sumeerbhola pointed out that we'll often also want to keep the neighbouring ranges for debugging, even if it isn't immediately adjacent in the keyspace. |
The pebble side of this is done (see cockroachdb/pebble#2237). |
Fantastic, thanks! Will storage be doing the CRDB integration as well, or would you like us (repl) to take it from here? |
I think your team might know better how to choose the spans to restrict the checkpoint to (sounds like we want the relevant range plus some ranges around it, plus some internal key space?) |
Yep, we'll pick it up. Thanks again Radu! |
@erikgrinaker @sumeerbhola In the checkpoint failure investigation a few months ago, which ranges in the checkpoint (except the inconsistent one) ended up useful for finding the bug? Was it only the checkpoint from the killed node that was useful? For the killed nodes we don't need to bother much about narrowing it, can take a full checkpoint. For those nodes that survive (and are likely correct), it is more important to reduce the cost of checkpoints, so we can take a partial checkpoint of only the range itself. The neighbouring ranges might be completely different across the 3 nodes, so not sure whether including them in the snapshot is useful. Also, only the range under consideration is So I'm inclined to propose this:
|
Although, the neighbouring ranges can be useful for debugging split/merge-related issues. @erikgrinaker @sumeerbhola @tbg Can you think of other reasons / examples why we would extend the checkpoint beyond the single range? |
The neighbouring ranges. There was a Pebble range tombstone in a neighbouring range which wasn't properly truncated to the range bounds and therefore extended into the inconsistent range, deleting a bunch of keys that it shouldn't have deleted. We want to preserve the neighbouring ranges so that we can look at interactions with e.g. range keys and range tombstones across ranges.
The neighbouring ranges are mostly useful to look at cross-range LSM interactions on a single node, so it's ok if they aren't completely in sync across nodes, as long as they're consistent on each node (which Pebble ensures they are).
That's true, but I figure we may as well just keep it simple and do the same checkpointing across all nodes? |
When the consistency checker detects a range inconsistency, it takes a storage checkpoint on all nodes with range replicas. These are hardlinks, so they're a cheap copy of the entire database. However, over time, as data is written, this copy will consume as much space as the main database. This can easily run the node out of disk rather rapidly.
We should consider only taking a checkpoint of the SSTs that are relevant to the replica instead, to avoid running the node out of disk. This should be made available as a Pebble database for those SSTs, so that usual debug tooling can be used to investigate it. It also needs to contain the relevant manifest history.
Note that we specifically don't want to export the KV pairs of the replica, since we often need the LSM structure for debugging, e.g. due to Pebble compaction bugs.
Jira issue: CRDB-20829
The text was updated successfully, but these errors were encountered: