Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: consistency check should only checkpoint relevant range #90543

Closed
erikgrinaker opened this issue Oct 24, 2022 · 10 comments · Fixed by #95963
Closed

kvserver: consistency check should only checkpoint relevant range #90543

erikgrinaker opened this issue Oct 24, 2022 · 10 comments · Fixed by #95963
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. N-followup Needs followup. O-postmortem Originated from a Postmortem action item.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 24, 2022

When the consistency checker detects a range inconsistency, it takes a storage checkpoint on all nodes with range replicas. These are hardlinks, so they're a cheap copy of the entire database. However, over time, as data is written, this copy will consume as much space as the main database. This can easily run the node out of disk rather rapidly.

We should consider only taking a checkpoint of the SSTs that are relevant to the replica instead, to avoid running the node out of disk. This should be made available as a Pebble database for those SSTs, so that usual debug tooling can be used to investigate it. It also needs to contain the relevant manifest history.

Note that we specifically don't want to export the KV pairs of the replica, since we often need the LSM structure for debugging, e.g. due to Pebble compaction bugs.

Jira issue: CRDB-20829

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Oct 24, 2022
@blathers-crl
Copy link

blathers-crl bot commented Oct 24, 2022

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor Author

@jbowens The majority of the work here will be in Pebble, should I write up a separate issue there?

@erikgrinaker
Copy link
Contributor Author

@sumeerbhola pointed out that we'll often also want to keep the neighbouring ranges for debugging, even if it isn't immediately adjacent in the keyspace.

@blathers-crl blathers-crl bot added the T-storage Storage Team label Oct 24, 2022
@nicktrav nicktrav added the O-postmortem Originated from a Postmortem action item. label Oct 26, 2022
@lunevalex lunevalex added the N-followup Needs followup. label Nov 21, 2022
@RaduBerinde RaduBerinde added the E-quick-win Likely to be a quick win for someone experienced. label Jan 13, 2023
@RaduBerinde
Copy link
Member

The pebble side of this is done (see cockroachdb/pebble#2237).

@erikgrinaker
Copy link
Contributor Author

Fantastic, thanks! Will storage be doing the CRDB integration as well, or would you like us (repl) to take it from here?

@RaduBerinde
Copy link
Member

I think your team might know better how to choose the spans to restrict the checkpoint to (sounds like we want the relevant range plus some ranges around it, plus some internal key space?)

@erikgrinaker
Copy link
Contributor Author

Yep, we'll pick it up. Thanks again Radu!

@erikgrinaker erikgrinaker removed the T-storage Storage Team label Jan 16, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jan 25, 2023

@erikgrinaker @sumeerbhola In the checkpoint failure investigation a few months ago, which ranges in the checkpoint (except the inconsistent one) ended up useful for finding the bug?

Was it only the checkpoint from the killed node that was useful? For the killed nodes we don't need to bother much about narrowing it, can take a full checkpoint. For those nodes that survive (and are likely correct), it is more important to reduce the cost of checkpoints, so we can take a partial checkpoint of only the range itself. The neighbouring ranges might be completely different across the 3 nodes, so not sure whether including them in the snapshot is useful. Also, only the range under consideration is raftMu-held, and neighbouring ranges (even if match) will be out of sync in these checkpoints.

So I'm inclined to propose this:

  • Take full checkpoint on the nodes to be killed.
  • Take single-range checkpoint on the survived nodes.

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 25, 2023

Although, the neighbouring ranges can be useful for debugging split/merge-related issues. @erikgrinaker @sumeerbhola @tbg Can you think of other reasons / examples why we would extend the checkpoint beyond the single range?

@erikgrinaker
Copy link
Contributor Author

In the checkpoint failure investigation a few months ago, which ranges in the checkpoint (except the inconsistent one) ended up useful for finding the bug?

The neighbouring ranges. There was a Pebble range tombstone in a neighbouring range which wasn't properly truncated to the range bounds and therefore extended into the inconsistent range, deleting a bunch of keys that it shouldn't have deleted. We want to preserve the neighbouring ranges so that we can look at interactions with e.g. range keys and range tombstones across ranges.

Also, only the range under consideration is raftMu-held, and neighbouring ranges (even if match) will be out of sync in these checkpoints.

The neighbouring ranges are mostly useful to look at cross-range LSM interactions on a single node, so it's ok if they aren't completely in sync across nodes, as long as they're consistent on each node (which Pebble ensures they are).

For the killed nodes we don't need to bother much about narrowing it, can take a full checkpoint. For those nodes that survive (and are likely correct), it is more important to reduce the cost of checkpoints, so we can take a partial checkpoint of only the range itself.

That's true, but I figure we may as well just keep it simple and do the same checkpointing across all nodes?

@craig craig bot closed this as completed in b031933 Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. N-followup Needs followup. O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants