Proposals should include a merkle root #13839

lavalamp · 2022-03-25T16:16:16Z

In response to #13766 and past issues (e.g. #11613) that get into the same condition:

It is possible for etcd to get into a condition where the databases on different replicas have different contents, and then proceed committing changes. There have been other routes to this in the past. I don't think it's possible to prevent all db corruption issues, but it is certainly possible to make etcd stop and recover when the DB doesn't match.

The reason etcd can proceed is because the protocol has replicas agree on the contents of changes but not on the state of the db after applying the change. The easiest fix for this is to adjust things to compute a hash covering the entire db. This can be done efficiently (log N hashing operations per db change) by using a merkle tree. If proposed changes also included a proposed new merkle root hash, a replica with a differing db would not be able to accept a change, and this condition would be caught the instant it happened.

(And it's recoverable at that point by reading the correct state from the other replicas. Moreover, depending on how you implement it, the merkle tree could be examined a layer at a time to find what is corrupted instead of needing to copy the entire state, which could be large.)

The merkle tree technique is very common in cryptocurrencies. etcd is basically a cryptocurrency with no token and a non-byzantine-resistant coordination mechanism. These are both correct tradeoffs given what etcd does, but continuing to accept changes to a corrupted state is extremely bad -- any cryptocurrency with a corresponding bug would totally go to zero.

lavalamp · 2022-03-25T16:19:44Z

cc @jpbetz, whom I've tried to sell on this in the past :)

lavalamp · 2022-03-25T16:23:10Z

Looks like Joe previously proposed this in #10893, but that went stale :(

serathius · 2022-03-25T17:11:02Z

Yep, I have pointed to problem of this issue going stale in #13775 and wanted to propose implementing #10893 as part of graduation of corruption check in #9190

Let's abovid tracking same issue in multiple places. @lavalamp do you want to propose implementing merkele trees as graduation criteria in #9190 and close this one?

lavalamp · 2022-03-25T17:19:06Z

That sounds great to me! How do I do that?

ptabor · 2022-03-25T17:38:55Z

+1.

I see 3 options:
a) merkle tree should be integrated into bbolt, i.e. we can reliably ask bbolt at each transactional state about it's checksum. This would guarantee bbolt level consistency and is probably the strongest option of physical consistency we can get... but requires impacting the data-storage format and might have biggest performance impact.

b) we somehow arbitrarily partition etcd key-space into markle tree. It's hard to represent the tree-structure of keys, as it's agnostic to etcd. It would be nice for debugging to keep it preserving the key-space continuity (i.e. neighbour-sorted keys are frequently sharing the same markle-tree node) - but that's seems difficult. If we hash each key individually and build the markle tree on the hashes of keys (e.g. bit-groups are defining the nodes structure), this would work but not help isolating the problem for debugging.

c) we don't need merkle tree. We maintain hash of the snapshot + chain of following hashes with all the proposals that are getting applied at the MVCC layer. This would guarantee determinism of actions performed on MVCC, but the inconsistency could originate from inside of mvcc implementation.

serathius · 2022-03-25T17:44:19Z

That sounds great to me! How do I do that?

Just leave a comment on #9190

lavalamp · 2022-03-25T17:56:22Z

re: a): I do not think it is a good idea to do this in the storage layer, since it has a bunch of stuff that is irrelevant to the state of the db at a given revision. (e.g., it depends on what has been compacted.) We should not make the etcd replicas hash computation depend on anything historical; it should be stateless, purely a function of the db state at a given revision. That permits e.g. efficiently un-corrupting a replica. It arrives at the correct state, but by an unconventional path.

re: b): I expect the easiest thing to do is a two step process:

compute a key || value hash for every key
put those in a separate patricia-merkle tree (e.g. from memory, bucket by prefix, hash each bucket, make a merkle tree from the bucket hashes; this gives the right properties)

Given an existing such structure and a hypothetical operation, it's easyish to figure out an expected root hash.

re: c): AFAICT if you do it that way, there's no way to efficiently un-corrupt the database. Additionally, even if you have a correct snapshot and a correct replay log, you can have a bug where a transaction gets applied wrong, and that doesn't get detected until the next snapshot is checked. This also requires all etcd replicas to take snapshots at the exact same time; I think that shouldn't be part of the state for correctness checking.

lavalamp · 2022-03-25T17:58:19Z

OK I commented there, if that's sufficient we can close this.

serathius · 2022-03-25T18:10:03Z

Based on comments above, there are non trivial decisions to be made, let's keep the issue open and continue the discussion on the design.

xiang90 · 2022-03-27T02:41:42Z

@ptabor @serathius

Thought about this a long time ago. Never got time to implement though :P Hope it still helps.

Step 1

enable the check hash at startup time. The full state scan is always the least error-prone and safest compared to other incremental approaches (Markle tree or snapshot+lineage, for example). Delay the startup seems to be OK from a performance/latency perspective.

Step 2

storage and piggyback consistent index for runtime checking. We can view the consistent index as a special incremental hash for the MVCC state with false-negative. If different members have different consistent indexes for the same applied index, then we know there must be an issue. I assume this approach will detect 95% of issues. It will also identify when the inconsistency happens.

Step 3

implement Markle tree or other tree-based incremental hash checking at the backend layer with low/no false negative. The backend is a layer above the bbolt database and a layer below MVCC. That is where things (state changes) get converged into a smaller set of APIs and we still have full control inside etcd (we see bbolt as an external dependency). The implementation should be a side-channel in-memory structure.

Some other follow-up steps to enable the incremental hash in a safe way. Initially, it should be just a checking mechanism, and should not stop the cluster from functioning. As we build more confidence in this hash checking, we can start to rely on it to do hand breaking, etc..

xiang90 · 2022-03-27T02:48:17Z

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

jpbetz · 2022-03-28T19:08:53Z

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

+1. I vague remember that on of the inconsistency bugs we found a few years back we improved the injection testing to check for failure around some of the restart functionality as a way of trying to prevent future occurrences of that class of issue.

serathius · 2022-03-30T08:18:21Z

+1 for more failure injection testing. Unfortunately depending on manual testing leads to inconsistency in release qualification, we need to invest more into automation. I'm planning to write a public postmortem that will go into actions we should make to prevent such issues in the future.

shalinmangar · 2022-04-01T14:08:56Z

How about automated testing based on Jepsen? Is that something that we run today regularly or before a release? I have some experience with it so I can help set it up.

lavalamp · 2022-04-01T17:00:07Z

Automated testing is good, but the problem is not just etcd bugs that are outside of our hypothesis space (which already testing is iffy at finding) -- it's actually not even sufficient for etcd code to be 100% correct. I'm thinking in particular about RAM, disk, or network-based corruption. The real task is to have the replicas arrive at the same state in the face of this kind of error.

MagicStarTrace · 2022-04-14T08:58:29Z

etcd: v3.5.2

I added it in the startup item, how can I verify whether it takes effect(experimental-initial-corrupt-check)?

It means that 3.5.3 does not need to add the parameter "--experimental-initial-corrupt-check"?

Thank You!

stale · 2022-07-13T23:43:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

serathius · 2022-07-15T07:23:26Z

This is proposed as postmortem action item and planned for v3.7

lavalamp · 2022-08-30T19:38:35Z

I've thought some more about this and I think significant gains can be had just from storing a hash of each {key,value} pair, so that corruption of individual values can be detected before building on them. This should also be simpler and a prerequisite to a global hash of some sort anyway.

serathius · 2022-09-01T08:12:50Z

Instead of checking single KV pairs I would look into adding hashes to bbolt pages. Bbolt implements b+tree which should be extendable to include hash values like merkle tree allowing checking consistency of whole database.

lavalamp · 2022-09-01T15:51:26Z

The benefit of hashing every KV pair:

you can check the hash when the KV is an input to a transaction
on finding a corrupt key, you can easily get the correct value from another member--recovery is very easy

Hashing bbolt pages is better than what we do now, but I don't think it's guaranteed that every replica ends up with the same bbolt layout? Also on a hash failure, recovery is very complicated, no?

lavalamp mentioned this issue Mar 25, 2022

Inconsistent revision and data occurs #13766

Closed

lavalamp mentioned this issue Mar 25, 2022

Stabilize experimental corruption check #9190

Open

serathius mentioned this issue Apr 11, 2022

Plans for v3.5.3 release #13894

Closed

28 tasks

serathius mentioned this issue Jun 8, 2022

Calculate hash during compaction #14049

Merged

stale bot added the stale label Jul 13, 2022

serathius added stage/tracked and removed stale labels Jul 15, 2022

ptabor mentioned this issue Nov 28, 2022

Identify corrupted member depending on quorum #14828

Merged

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 7, 2022

serathius mentioned this issue May 19, 2023

Adding checksum to each page etcd-io/bbolt#492

Open

serathius mentioned this issue May 30, 2023

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Closed

jmhbnz added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposals should include a merkle root #13839

Proposals should include a merkle root #13839

lavalamp commented Mar 25, 2022 •

edited

Loading

lavalamp commented Mar 25, 2022

lavalamp commented Mar 25, 2022

serathius commented Mar 25, 2022

lavalamp commented Mar 25, 2022

ptabor commented Mar 25, 2022 •

edited

Loading

serathius commented Mar 25, 2022

lavalamp commented Mar 25, 2022

lavalamp commented Mar 25, 2022

serathius commented Mar 25, 2022 •

edited

Loading

xiang90 commented Mar 27, 2022 •

edited

Loading

xiang90 commented Mar 27, 2022

jpbetz commented Mar 28, 2022

serathius commented Mar 30, 2022

shalinmangar commented Apr 1, 2022

lavalamp commented Apr 1, 2022

MagicStarTrace commented Apr 14, 2022

stale bot commented Jul 13, 2022

serathius commented Jul 15, 2022

lavalamp commented Aug 30, 2022

serathius commented Sep 1, 2022

lavalamp commented Sep 1, 2022

Proposals should include a merkle root #13839

Proposals should include a merkle root #13839

Comments

lavalamp commented Mar 25, 2022 • edited Loading

lavalamp commented Mar 25, 2022

lavalamp commented Mar 25, 2022

serathius commented Mar 25, 2022

lavalamp commented Mar 25, 2022

ptabor commented Mar 25, 2022 • edited Loading

serathius commented Mar 25, 2022

lavalamp commented Mar 25, 2022

lavalamp commented Mar 25, 2022

serathius commented Mar 25, 2022 • edited Loading

xiang90 commented Mar 27, 2022 • edited Loading

xiang90 commented Mar 27, 2022

jpbetz commented Mar 28, 2022

serathius commented Mar 30, 2022

shalinmangar commented Apr 1, 2022

lavalamp commented Apr 1, 2022

MagicStarTrace commented Apr 14, 2022

stale bot commented Jul 13, 2022

serathius commented Jul 15, 2022

lavalamp commented Aug 30, 2022

serathius commented Sep 1, 2022

lavalamp commented Sep 1, 2022

lavalamp commented Mar 25, 2022 •

edited

Loading

ptabor commented Mar 25, 2022 •

edited

Loading

serathius commented Mar 25, 2022 •

edited

Loading

xiang90 commented Mar 27, 2022 •

edited

Loading