Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate hash during compaction #14049

Merged
merged 22 commits into from
Jun 14, 2022
Merged

Conversation

serathius
Copy link
Member

@serathius serathius commented May 17, 2022

Implements low cost hash calculation during compaction as proposed in #14039

This PR was deliberately split into smaller commits to avoid any mistakes in code. Recommend to review it one by one.

To make sure that hash calculated by hashKV and scheduleCompaction is the same, I needed to change API so that it's always accompanied by revision range it was calculated on, which is change from previous API that always returned latest revision (WAT?).

I also introduced a intense tests for hash function on a first commit to make sure I will not change result of hashing function.

This PR is big even though it's meant to be backported to v3.5, this is by design as I don't think we could make a quick hacky change and not make mistake. I included some refactors that should make code clear to understood.

cc @ahrtr @ptabor

@codecov-commenter
Copy link

codecov-commenter commented May 17, 2022

Codecov Report

Merging #14049 (735faeb) into main (908faa4) will decrease coverage by 0.15%.
The diff coverage is 85.24%.

@@            Coverage Diff             @@
##             main   #14049      +/-   ##
==========================================
- Coverage   75.20%   75.05%   -0.16%     
==========================================
  Files         451      453       +2     
  Lines       36791    36873      +82     
==========================================
+ Hits        27669    27675       +6     
- Misses       7392     7446      +54     
- Partials     1730     1752      +22     
Flag Coverage Δ
all 75.05% <85.24%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/storage/mvcc/kv.go 40.00% <ø> (ø)
server/etcdserver/corrupt.go 56.47% <40.00%> (ø)
server/storage/mvcc/kvstore.go 87.58% <87.50%> (-0.73%) ⬇️
server/storage/mvcc/kvstore_compaction.go 95.65% <92.85%> (+0.78%) ⬆️
server/storage/mvcc/testutil/hash.go 95.74% <95.74%> (ø)
server/etcdserver/api/v3rpc/maintenance.go 75.34% <100.00%> (ø)
server/storage/mvcc/hash.go 100.00% <100.00%> (ø)
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) ⬇️
server/etcdserver/api/v3lock/lock.go 61.53% <0.00%> (-8.47%) ⬇️
server/etcdserver/api/v2stats/queue.go 20.83% <0.00%> (-7.38%) ⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 908faa4...735faeb. Read the comment docs.

server/storage/mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved
server/storage/mvcc/kvstore.go Outdated Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented May 17, 2022

It should have no performance loss after a quick review. Will have a deeper review when the PR is marked as ready for review.

@serathius serathius force-pushed the compact-hash branch 7 times, most recently from cfbb49f to a46abfc Compare May 19, 2022 16:19
@serathius serathius changed the title [WIP] Calculate hash during compaction Calculate hash during compaction May 19, 2022
@serathius serathius marked this pull request as ready for review May 19, 2022 16:35
@serathius
Copy link
Member Author

Pushing PR for review. cc @ahrtr @ptabor
This one only implements calculating hash, I will prepare separate for validating hashes.

server/storage/mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved
keys, _ := tx.UnsafeRange(schema.Key, last, end, int64(batchNum))
for _, key := range keys {
rev = bytesToRev(key)
keys, values := tx.UnsafeRange(schema.Key, last, end, int64(batchNum))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see that end is defined as:

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

I'm afraid this logic would miss all entries created after the most recent compaction... but maybe I'm missing sth...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compactMainRev is revision that is being compacted. This is our goal, calculate hash at the same time as compaction is being done, thus not increasing number of times entry is touched. We don't care about entries after most recent compaction.

@@ -22,7 +22,7 @@ import (
"go.etcd.io/etcd/server/v3/storage/schema"
)

func unsafeHashByRev(tx backend.ReadTx, lower, upper revision, keep map[revision]struct{}) (uint32, error) {
func unsafeHashByRev(tx backend.ReadTx, lower, upper int64, keep map[revision]struct{}) (uint32, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish revision was strongly typed:

type Revision int64 such that it's hard to miss-type between int and Revision.

Probably for another refactoring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the good point. Here I wanted to stop using revision type to ensure that only main revision is set. revision type allows setting sub revision could lead to hard to predict side effects.

@ahrtr
Copy link
Member

ahrtr commented May 20, 2022

I will take a deep review once it's rebased to resolve the conflict.

@serathius
Copy link
Member Author

Done

server/etcdserver/api/v3rpc/maintenance.go Outdated Show resolved Hide resolved
server/storage/mvcc/kvstore.go Show resolved Hide resolved
server/storage/mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved
server/storage/mvcc/hash.go Show resolved Hide resolved
server/etcdserver/corrupt.go Outdated Show resolved Hide resolved
server/etcdserver/corrupt.go Outdated Show resolved Hide resolved
server/etcdserver/corrupt.go Outdated Show resolved Hide resolved
@serathius serathius force-pushed the compact-hash branch 3 times, most recently from f995f8b to 8f6924c Compare June 7, 2022 14:39
@serathius
Copy link
Member Author

Added changes needed to expose compact hash in HashByRev function and implemented integration tests.

server/storage/mvcc/hash_test.go Show resolved Hide resolved
server/storage/mvcc/hash_test.go Show resolved Hide resolved
)
s.hashMu.Lock()
defer s.hashMu.Unlock()
s.hashes = append(s.hashes, hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a member gets restarted? all hashes will get lost? Should we persist the hashes into db?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully correct solution would persist hashes into WAL entries, however we cannot backport this change. You are right, not persisting hashes would mean that restarts are disruptive, however I prefer to have this flaw than develop a temporary solution.

Not every mechanism in etcd works well with restarts. Definitely consistency of KV store needs to be maintained independent of restarts, however secondary mechanism like leases will not work if etcd is frequently restarted. It's up to admin to ensure that etcd is not too frequently restarted so leases work well. I think it's reasonable to assume that etcd doesn't restart for non-critical features to work.

For this v3.5 backport I would treat correctness checking as secondary mechanism that requires etcd not to restart and for v3.6 I would look into making the mechanism fully independent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think about this and get back later.

Fully correct solution would persist hashes into WAL entries

Not sure what do you mean? Do you mean you want to depend on WAL replaying on startup to recover the hashes?

Copy link
Member Author

@serathius serathius Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that we will introduce new WAL entry with Hash. Leader would periodically calculate it hash and send it as a raft proposal. When applying followers would verify if their hash matches one in raft.

That would be a mid term solution before implementing merkele root #13839

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can work, but looks like it doesn't make much sense. The raft is just like a network transport channel in this case instead of a consensus protocol. The leader just transports the hash to followers. Why not use the existing peer communication mechanism?

But it might be useful on the solution to implement merkele root?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea to use raft to deliver hash is based on fact that there is no other way to guarantee member synchronization. Main flow of the current system is drift between members causing making it unreliable to query about particular revision. Leader can never check hash on very slow followers as they will not have requested revision. This effort is a workaround for this by increasing some history of hashes allowing leader to more reliably query about them.

Main reason to verify hash via raft is fact that it be in WAL log. It allows hash check to be part of sequential WAL entry apply logic. This means that it will be ordered relative to KV changes. Benefits:

  • hash check execution will be guaranteed to succeed (not influenced by drift)
  • hash check will be on latest revision

The only downside of this is that calculating hash will be costly, meaning it will not make sense to run it frequently (only once couple of seconds/minutes depending on cluster backup setup). For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For immediate hash verification on every entry we would need merkele root allowing us automatic, point-in-time recovery from data corruption.

The historical data has already been compacted, how can you perform the point-in-time recovery from data corruption? It seems that we can only recover the data from the backup of the db files. I assume we need to deliver a separate tool or new commands in etcdctl or etcdutl to do it in future. Please let me know if I missed anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we need to get this sorted out before merging this PR.

@serathius serathius force-pushed the compact-hash branch 2 times, most recently from c2ae790 to e321717 Compare June 8, 2022 11:24
@serathius
Copy link
Member Author

As corruption code is super delicate and I didn't want to take a chance, I added a set of unit test to cover all branches of the code. All other changes are just rebasing on the refactor needed for unit test.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Get 100% coverage on InitialCheck and PeriodicCheck functions to avoid
any mistakes.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Makes it easier to test hash match between scheduleCompaction and
HashByRev.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
… Defrag

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

Plan to backport when whole solution merged to main branch.

@serathius serathius mentioned this pull request Jun 21, 2022
16 tasks
@serathius serathius deleted the compact-hash branch June 15, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants