Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals should include a merkle root #13839

Open
lavalamp opened this issue Mar 25, 2022 · 21 comments
Open

Proposals should include a merkle root #13839

lavalamp opened this issue Mar 25, 2022 · 21 comments
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. stage/tracked

Comments

@lavalamp
Copy link

lavalamp commented Mar 25, 2022

In response to #13766 and past issues (e.g. #11613) that get into the same condition:

It is possible for etcd to get into a condition where the databases on different replicas have different contents, and then proceed committing changes. There have been other routes to this in the past. I don't think it's possible to prevent all db corruption issues, but it is certainly possible to make etcd stop and recover when the DB doesn't match.

The reason etcd can proceed is because the protocol has replicas agree on the contents of changes but not on the state of the db after applying the change. The easiest fix for this is to adjust things to compute a hash covering the entire db. This can be done efficiently (log N hashing operations per db change) by using a merkle tree. If proposed changes also included a proposed new merkle root hash, a replica with a differing db would not be able to accept a change, and this condition would be caught the instant it happened.

(And it's recoverable at that point by reading the correct state from the other replicas. Moreover, depending on how you implement it, the merkle tree could be examined a layer at a time to find what is corrupted instead of needing to copy the entire state, which could be large.)

The merkle tree technique is very common in cryptocurrencies. etcd is basically a cryptocurrency with no token and a non-byzantine-resistant coordination mechanism. These are both correct tradeoffs given what etcd does, but continuing to accept changes to a corrupted state is extremely bad -- any cryptocurrency with a corresponding bug would totally go to zero.

@lavalamp
Copy link
Author

cc @jpbetz, whom I've tried to sell on this in the past :)

@lavalamp
Copy link
Author

Looks like Joe previously proposed this in #10893, but that went stale :(

@serathius
Copy link
Member

Yep, I have pointed to problem of this issue going stale in #13775 and wanted to propose implementing #10893 as part of graduation of corruption check in #9190

Let's abovid tracking same issue in multiple places. @lavalamp do you want to propose implementing merkele trees as graduation criteria in #9190 and close this one?

@lavalamp
Copy link
Author

That sounds great to me! How do I do that?

@ptabor
Copy link
Contributor

ptabor commented Mar 25, 2022

+1.

I see 3 options:
a) merkle tree should be integrated into bbolt, i.e. we can reliably ask bbolt at each transactional state about it's checksum. This would guarantee bbolt level consistency and is probably the strongest option of physical consistency we can get... but requires impacting the data-storage format and might have biggest performance impact.

b) we somehow arbitrarily partition etcd key-space into markle tree. It's hard to represent the tree-structure of keys, as it's agnostic to etcd. It would be nice for debugging to keep it preserving the key-space continuity (i.e. neighbour-sorted keys are frequently sharing the same markle-tree node) - but that's seems difficult. If we hash each key individually and build the markle tree on the hashes of keys (e.g. bit-groups are defining the nodes structure), this would work but not help isolating the problem for debugging.

c) we don't need merkle tree. We maintain hash of the snapshot + chain of following hashes with all the proposals that are getting applied at the MVCC layer. This would guarantee determinism of actions performed on MVCC, but the inconsistency could originate from inside of mvcc implementation.

@serathius
Copy link
Member

That sounds great to me! How do I do that?

Just leave a comment on #9190

@lavalamp
Copy link
Author

re: a): I do not think it is a good idea to do this in the storage layer, since it has a bunch of stuff that is irrelevant to the state of the db at a given revision. (e.g., it depends on what has been compacted.) We should not make the etcd replicas hash computation depend on anything historical; it should be stateless, purely a function of the db state at a given revision. That permits e.g. efficiently un-corrupting a replica. It arrives at the correct state, but by an unconventional path.

re: b): I expect the easiest thing to do is a two step process:

  1. compute a key || value hash for every key
  2. put those in a separate patricia-merkle tree (e.g. from memory, bucket by prefix, hash each bucket, make a merkle tree from the bucket hashes; this gives the right properties)

Given an existing such structure and a hypothetical operation, it's easyish to figure out an expected root hash.

re: c): AFAICT if you do it that way, there's no way to efficiently un-corrupt the database. Additionally, even if you have a correct snapshot and a correct replay log, you can have a bug where a transaction gets applied wrong, and that doesn't get detected until the next snapshot is checked. This also requires all etcd replicas to take snapshots at the exact same time; I think that shouldn't be part of the state for correctness checking.

@lavalamp
Copy link
Author

OK I commented there, if that's sufficient we can close this.

@serathius
Copy link
Member

serathius commented Mar 25, 2022

Based on comments above, there are non trivial decisions to be made, let's keep the issue open and continue the discussion on the design.

@xiang90
Copy link
Contributor

xiang90 commented Mar 27, 2022

@ptabor @serathius

Thought about this a long time ago. Never got time to implement though :P Hope it still helps.

Step 1

  • enable the check hash at startup time. The full state scan is always the least error-prone and safest compared to other incremental approaches (Markle tree or snapshot+lineage, for example). Delay the startup seems to be OK from a performance/latency perspective.

Step 2

  • storage and piggyback consistent index for runtime checking. We can view the consistent index as a special incremental hash for the MVCC state with false-negative. If different members have different consistent indexes for the same applied index, then we know there must be an issue. I assume this approach will detect 95% of issues. It will also identify when the inconsistency happens.

Step 3

  • implement Markle tree or other tree-based incremental hash checking at the backend layer with low/no false negative. The backend is a layer above the bbolt database and a layer below MVCC. That is where things (state changes) get converged into a smaller set of APIs and we still have full control inside etcd (we see bbolt as an external dependency). The implementation should be a side-channel in-memory structure.

Some other follow-up steps to enable the incremental hash in a safe way. Initially, it should be just a checking mechanism, and should not stop the cluster from functioning. As we build more confidence in this hash checking, we can start to rely on it to do hand breaking, etc..

@xiang90
Copy link
Contributor

xiang90 commented Mar 27, 2022

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

@jpbetz
Copy link
Contributor

jpbetz commented Mar 28, 2022

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

+1. I vague remember that on of the inconsistency bugs we found a few years back we improved the injection testing to check for failure around some of the restart functionality as a way of trying to prevent future occurrences of that class of issue.

@serathius
Copy link
Member

+1 for more failure injection testing. Unfortunately depending on manual testing leads to inconsistency in release qualification, we need to invest more into automation. I'm planning to write a public postmortem that will go into actions we should make to prevent such issues in the future.

@shalinmangar
Copy link

How about automated testing based on Jepsen? Is that something that we run today regularly or before a release? I have some experience with it so I can help set it up.

@lavalamp
Copy link
Author

lavalamp commented Apr 1, 2022

Automated testing is good, but the problem is not just etcd bugs that are outside of our hypothesis space (which already testing is iffy at finding) -- it's actually not even sufficient for etcd code to be 100% correct. I'm thinking in particular about RAM, disk, or network-based corruption. The real task is to have the replicas arrive at the same state in the face of this kind of error.

@MagicStarTrace
Copy link

etcd: v3.5.2

I added it in the startup item, how can I verify whether it takes effect(experimental-initial-corrupt-check)?

image

It means that 3.5.3 does not need to add the parameter "--experimental-initial-corrupt-check"?

Thank You!

@stale
Copy link

stale bot commented Jul 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 13, 2022
@serathius
Copy link
Member

This is proposed as postmortem action item and planned for v3.7

@lavalamp
Copy link
Author

I've thought some more about this and I think significant gains can be had just from storing a hash of each {key,value} pair, so that corruption of individual values can be detected before building on them. This should also be simpler and a prerequisite to a global hash of some sort anyway.

@serathius
Copy link
Member

Instead of checking single KV pairs I would look into adding hashes to bbolt pages. Bbolt implements b+tree which should be extendable to include hash values like merkle tree allowing checking consistency of whole database.

@lavalamp
Copy link
Author

lavalamp commented Sep 1, 2022

The benefit of hashing every KV pair:

  • you can check the hash when the KV is an input to a transaction
  • on finding a corrupt key, you can easily get the correct value from another member--recovery is very easy

Hashing bbolt pages is better than what we do now, but I don't think it's guaranteed that every replica ends up with the same bbolt layout? Also on a hash failure, recovery is very complicated, no?

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 7, 2022
@jmhbnz jmhbnz added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. stage/tracked
Development

No branches or pull requests

8 participants