Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lease checkpoints fix #13491

Closed

Conversation

michaljasionowski
Copy link
Contributor

This PR does 2 things:

  • fixes a bug that stops checkpoint scheduling after first leader change
  • extends checkpointing to server restarts (original implementation was targeting only leader changes)

Lease checkpointing is a mechanism (currently turned off by default) that is used to prevent lease TTLs from being reset to initial value after each leader change.
Currently, checkpoints are scheduled only until the first leader change. After any subsequent leader change TTL is set to the value from before the first leader change. To fix this, checkpoint scheduling has been forced for all leases after the leader change.

After fixing the first bug, lease checkpointing still stops working after a cluster restart, or any situation when server state has to be restored from the raft log. This is fixed by forcing etcd server to apply all LeaseCheckpoint requests it processes, which in the case of server restart means also LeaseCheckpoint requests that were already applied before the restart.
Another possible solution to this problem would be to store scheduled checkpoints in the KV store.

Integration tests have been improved to cover both issues.

Current checkpointing mechanism is buggy. New checkpoints for any lease
are scheduled only until the first leader change. Added fix for that
and a test that will check it.
To extend lease checkpointing mechanism to cases when the whole etcd
cluster is restarted. If etcd server has to restore its state from
the raft logs, all LeaseCheckpoint requests will be applied to the
server, regardles of the index value. This will set remaining TTLs
to values from before the restart. Otherwise, remaining TTLs would be
reset to initial TTLs after each cluster restart.
Added integration test to cover this case.
@serathius
Copy link
Member

cc @ptabor @jpbetz

@serathius serathius self-requested a review November 22, 2021 09:47
@serathius
Copy link
Member

cc @hexfusion

@serathius
Copy link
Member

Your PR includes 3 merge commits that unnecessary complicate git commit history, can you clean them up? (Happy to help you don't know how).

@serathius
Copy link
Member

serathius commented Nov 22, 2021

After fixing the first bug, lease checkpointing still stops working after a cluster restart, or any situation when server state has to be restored from the raft log. This is fixed by forcing etcd server to apply all LeaseCheckpoint requests it processes, which in the case of server restart means also LeaseCheckpoint requests that were already applied before the restart.
Another possible solution to this problem would be to store scheduled checkpoints in the KV store.

One problem I see is inconsistent behavior depending on how much raft log is replayed. Checkpoints only impact state stored in memory and are not persistent. This means that the end ttl will depend on how much log is replayed. Etcd raft log is replayed from last snapshot, which is triggered every 10`000 entries (by default, can be changed), so whether checkpoints (done every 5 minutes) are available since last checkpoint will depend on how many proposals per second are handled by cluster.

Change with forcing V3 apply for Checkpoint doesn't really solve the problem. ApplyBoth here only works makes etcd consider raft logs from last snapshot (happens every 10000 entries) instead of last commit (every 5 seconds). I bring this up as we are planning to remove v2 store and we will no longer need to replay raft from last snapshot.

Can you describe scenarios where replaying from raft log is needed? What failure scenarios we want to handle and in which ones replaying raft log is needed.

@@ -446,6 +446,7 @@ func (le *lessor) Promote(extend time.Duration) {
l.refresh(extend)
item := &LeaseWithTime{id: l.ID, time: l.expiry}
le.leaseExpiredNotifier.RegisterOrUpdate(item)
le.scheduleCheckpointIfNeeded(l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that commenting out this line doesn't break the tests. Can you add a test for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I wonder why `le.scheduleCheckpointIfNeeded(l)' in line 487 is not handling this case.

@@ -1901,6 +1901,11 @@ func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) {
s.w.Trigger(r.ID, s.applyV2Request((*RequestV2)(rp), shouldApplyV3))
return
}
if !shouldApplyV3 && raftReq.LeaseCheckpoint != nil {
Copy link
Member

@serathius serathius Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes checkpoints depend whether there was a checkpoint since last snapshot. I think we should consider persisting remainingTTL from checkpoints into backend.

@ptabor
Copy link
Contributor

ptabor commented Nov 22, 2021

I find the current layout scary. A single object (lesser) that has parts of state:

  1. updated pre-raft

  2. updated post-raft but in-memory (Expiration TTL). Raft is really a 'networking mechanism', not part of the consensus.

  3. updated post-raft in V3 backend (existence of Lease object written by Grant:

    lpb := leasepb.Lease{ID: int64(l.ID), TTL: l.ttl, RemainingTTL: l.remainingTTL}
    )

And no protection that the post-raft code does not depend on pre-raft code, so risking introducing indeterminism.

IMHO we should split the lesser into explicit 2-3 objects, where 2. & 3. can be only mutated by post-raft, and they do not depend on 1 in any way. The same way 3. cannot depend on 2. (as 2. is currently indeterministic).


Looking at the @jpbetz desire (#9924), the goal was to minimize number of rafts + writes performed during renewal. Without persisting ExpirationTTL as part of the state, we land with fuzzy definition of 2.,
i.e. either ExpirationTTL is correct or it's 0 (meaning unknown, assume full TTL) and this state can depend between consensus members.


If I understand Marek Proposal, it would be actually merging 2&3:

  • additional cost to update Expiration TTL in bbolt should be actually not that significant

  • we would not persist leases in storeV2 snapshot (as it does not contains leases at all currently)

  • there would no guarantee (as currently) that ExpirationTTL for given lease is set... The guarantee would be that if its set to not 0, it's bigger than the remaining time.

  • This seems good to me, but we should check @jpbetz perspective.

  • I think we should split leaser into 2 objects (pre & post raft)

  • It might be a good exercise to write a document (public markdown) that documents how leasing work with respect to raft and guarantees.

@serathius serathius added this to the etcd-v3.6 milestone Nov 25, 2021
@serathius
Copy link
Member

Closing for #13508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants