You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In etcd 3.4, a corruption checker will be enabled by default on startup, and will be available to run periodically using the --corrupt-check-time flag. This corruption checker will perform a full pass of all uncompacted revisions, computing hashes for all recorded KeyValue records stored. Both the etcd member initiating the corruption check and all peers must compute the checksum so they can be compared. This can be expensive.
We should consider instead computing CRCs for each KeyValue when they are persisted, and writing these CRCs to disk with the KeyValue record. We could them to compute hashes of the entire db state from these CRCs, and separately, validate that a KeyValue's data matches the CRCs whenever reading it from file (e.g. indexing, data serving).
This would result in both faster consistency checking, and improved data corruption checking.
It would be a good first step toward fully incremental hash generation, which could be something like:
Put each revision's KeyValue entry into a merkle tree where each node is identified by the revision. (need to account for subrevisions and tombstones here).
Each time a revision is written, update the merkle tree with a node for the revision.
Each time a compaction occurs remove compacted nodes from the merkle tree (this might be a problem, suggestions welcome on how we might make this faster/cheaper).
Regardless, even before introducing a merkle tree, the faster checksum generation that could be used with the existing corruption checker, and would also work well for Chubby checksum checking (See section 6.2 of Chubby: Paxos Made Live - An Engineering Perspective) which I prefer because the members all compute checksums based on a raft log entry, so would have their keyspaces compacted to the same point, which avoids having to skip checks because of compaction index mismatches.
@jpbetz The corruption checker has been delayed for 3.5, because we haven't figured out how to deal with TLS-secured peers. Let's milestone this for 3.5.
@jpbetz The corruption checker has been delayed for 3.5, because we haven't figured out how to deal with TLS-secured peers. Let's milestone this for 3.5.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
In etcd 3.4, a corruption checker will be enabled by default on startup, and will be available to run periodically using the
--corrupt-check-time
flag. This corruption checker will perform a full pass of all uncompacted revisions, computing hashes for all recordedKeyValue
records stored. Both the etcd member initiating the corruption check and all peers must compute the checksum so they can be compared. This can be expensive.We should consider instead computing CRCs for each KeyValue when they are persisted, and writing these CRCs to disk with the KeyValue record. We could them to compute hashes of the entire db state from these CRCs, and separately, validate that a KeyValue's data matches the CRCs whenever reading it from file (e.g. indexing, data serving).
This would result in both faster consistency checking, and improved data corruption checking.
It would be a good first step toward fully incremental hash generation, which could be something like:
KeyValue
entry into a merkle tree where each node is identified by the revision. (need to account for subrevisions and tombstones here).Regardless, even before introducing a merkle tree, the faster checksum generation that could be used with the existing corruption checker, and would also work well for Chubby checksum checking (See section 6.2 of Chubby: Paxos Made Live - An Engineering Perspective) which I prefer because the members all compute checksums based on a raft log entry, so would have their keyspaces compacted to the same point, which avoids having to skip checks because of compaction index mismatches.
cc @jingyih @wenjiaswe @lavalamp @gyuho
The text was updated successfully, but these errors were encountered: