Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lease] Refactor lease renew request via raft #14094

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Jun 7, 2022

Previously the lease renew request can only be processed by the leader. When a follower receives the renew request, it just forwards the request to the leader via an internal http channel. This isn't accurate because the leader may change during the process.

When a leader receives the renew request, the previous implementation follows a three stages workflow: pre-raft, raft and post-raft. It's too complicated and error prone, and the raft is more like just a network transport channel instead of a consensus mechanism in this case. Please also see issuecomment-975817268.

So in this PR, we process the renew request via the raft directly, it can greatly simplify the code.

The client facing API keeps unchanged, so it has no any impact on client applications, including Kubernetes.

@ahrtr
Copy link
Member Author

ahrtr commented Jun 7, 2022

cc @serathius @ptabor @spzala @mitake

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2022

Codecov Report

Merging #14094 (cb249cc) into main (4ce7a85) will decrease coverage by 0.04%.
The diff coverage is 90.19%.

@@            Coverage Diff             @@
##             main   #14094      +/-   ##
==========================================
- Coverage   75.08%   75.04%   -0.05%     
==========================================
  Files         452      452              
  Lines       36781    36826      +45     
==========================================
+ Hits        27618    27636      +18     
- Misses       7424     7447      +23     
- Partials     1739     1743       +4     
Flag Coverage Δ
all 75.04% <90.19%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/config/config.go 79.76% <ø> (ø)
server/embed/config.go 73.57% <ø> (ø)
server/etcdserver/apply/corrupt.go 17.64% <0.00%> (-2.36%) ⬇️
server/lease/lessor.go 87.81% <84.21%> (-0.99%) ⬇️
server/embed/etcd.go 75.09% <100.00%> (+0.19%) ⬆️
server/etcdmain/config.go 86.17% <100.00%> (+0.05%) ⬆️
server/etcdserver/api/v3rpc/lease.go 82.27% <100.00%> (ø)
server/etcdserver/apply/apply.go 82.64% <100.00%> (+0.24%) ⬆️
server/etcdserver/apply/uber_applier.go 91.48% <100.00%> (+0.18%) ⬆️
server/etcdserver/v3_server.go 75.76% <100.00%> (-2.25%) ⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ce7a85...cb249cc. Read the comment docs.

@serathius
Copy link
Member

Discussion on lease issues and proposed fixes are so spread I find myself unable to comprehend it all.

@ahrtr can you help me understand how this change refers to #13915 and what is our overall plan. It would be great to have one umbrella issue that shows high level plan so we can make sure we are going in right direction.

Not sure I remember ever discussing moving Grant to raft as there were some performance concerns. With a lot of leases, grant requests can be done very frequently and they are latency sensitive making them much more vulnerable to network hiccup. Before we proceed with this PR would be good to dig up the original discussion and do some loadtesting. I haven't look to deep into this PR so I might be wrong about performance concerns.

@ahrtr
Copy link
Member Author

ahrtr commented Jun 8, 2022

This is a standalone refactor on Renew instead of Grant.

It took me sometime to go through some historical PRs ( 9924, 9526, 9699, 13508 ) before delivering this PR.

I have thought about three solutions (see below) to refactor lease, but eventually I realized that none of them are feasible.

  1. persist expiryTime instead of remainingTTL into db, and it's similar to what 9526 did. And we don't need the Checkpoint functionality at all, each member, including leader and followers, calculate the expiryTime themselves. Since different nodes may have inconsistent wall time, so only the leader can check the expiry of each lease. The reason why each follower also calcuates the expiryTime is to prepare for the leader change case. Once a follower becomes a leader, it just uses its local expiryTime to continue to serve the leases.

This solution can greatly simplify the overall design and implementation, because we don't need the checkpoint functionality anymore. If a member's time jumps forwards or backwards drastically, then we will run into issue no matter which solution we follow. But this solution will be even worse in this case, because existing implementation will only be impacted when the leader's time jumps; but this solution might run into issue if any member's time jumps. Of course, each time revoking & renewing can fix & reset the time jumps.

Another downside of this solution is when the cluster is down for some time (such as maintenance window), then all leases might be expired when the cluster starts again. The existing implementation will not have this issue, because it persists the remainingTTL instead of expiryTime. The customer/client shouldn't pay for the server side issue. So persisting remainingTTL makes more sense then persisting expiryTime from this perspective.

  1. Introduce a concept clusterClock, so that all members, including leader and followers, depend on the clusterClock instead of its local wal time. In this case, we don't need the checkpoint either. It can also greatly simplify the overall design & implementation. But it isn't that easy to implement a clusterClock. :(. Please refer to logcabin. Please also let me know if you have any proposals. :)

  2. We can introduce NTP (network time protocol), but it will complicate the overall design. And it still can't guarantee the consistent of wall time between members .

Speaking to the overall plan, my thoughts are:

  1. Keep the current implementation of Grant, Revoke, Checkpoint as they are for now. All of them are going through raft, it should be OK. Especially the checkpoint syncs remainingTTL to followers via raft, which is good to me. Please see comment (remainingTTL vs expiryTime) above on the first infeasible solution.
  2. Refactor lease renew request via raft. This is exactly this PR fixes. Please read the description of this PR.
  3. Modify LeaseLease/ListLease to make sure the member has applied the latest commitID. We just need to use linearizableReadNotify. I think the PR 13882 should be the right direction.
  4. Regarding to the LeaseTimeToLive, let's keep it as it's for now, b ecause we can only depend on the leader's local wal time. The followers' local time may be inconsistency.

So in summary, we only need to refactor renew (this PR) and modify Listlease ( 13882 ). They are two separate & independent changes.

@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch from 49dcdc9 to 96bf559 Compare June 10, 2022 02:49
@ahrtr ahrtr marked this pull request as draft June 10, 2022 02:56
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch 2 times, most recently from eb441c1 to da0353a Compare June 10, 2022 08:11
@ahrtr ahrtr marked this pull request as ready for review June 10, 2022 08:14
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch 4 times, most recently from 62e541c to 5bc4390 Compare June 14, 2022 11:29
@ahrtr ahrtr added this to the etcd-v3.6 milestone Jun 15, 2022
@stale
Copy link

stale bot commented Sep 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@ptabor ptabor self-assigned this Oct 5, 2022
Executed:
1. ./scripts/genproto.sh
2. ./scripts/update_proto_annotations.sh

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch 2 times, most recently from ac7d4b1 to 899a60a Compare March 7, 2023 07:29
@ahrtr ahrtr marked this pull request as draft March 7, 2023 07:44
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch 2 times, most recently from eb63556 to c16bc84 Compare March 7, 2023 08:24
@ahrtr ahrtr marked this pull request as ready for review March 7, 2023 08:28
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch from c16bc84 to ba9d731 Compare March 7, 2023 08:52
@ahrtr ahrtr marked this pull request as draft March 7, 2023 09:11
Previously, the renew request can only be processed by the leader.
If a follower receives the renew request, it just forwards the
request to the leader via a internal http channel. This isn't
accurate because the leader may change during the process.

When a leader receives the renew request, the previous implementation
follows a three stage workflow: pre-raft, raft and post-raft. It's
too complicated and error prone, and the raft is more like just a
network transport channel instead of a concensus mechanism in this
case.

So we process the renew request via raft directly, it can greatly
simplify the code.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
…>= 3.6

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr ahrtr force-pushed the lease_renew_refactor_20220607 branch from ba9d731 to bafc656 Compare March 7, 2023 10:38
@@ -207,6 +208,11 @@ func (a *applierV3backend) LeaseRevoke(lc *pb.LeaseRevokeRequest) (*pb.LeaseRevo
return &pb.LeaseRevokeResponse{Header: a.newHeader()}, err
}

func (a *applierV3backend) LeaseRenew(lc *pb.LeaseKeepAliveRequest) (*pb.LeaseKeepAliveResponse, error) {
Copy link
Contributor

@mitake mitake May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still possible to apply LeaseRenew entries issued by a stale leader unless it provides a mechanism like comparing term #15247 (comment) ?
I think all Raft messages issued by etcd itself might have similar problems potentially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simpler approach is not using MsgProp and issuing lease related requests as MsgApp (related discussion: #15944 (comment)). With this approach we might be able to solve the issue without changing the WAL format. Its implementation will be tricky though. I'll try this idea this weekend if I can have time.

@k8s-ci-robot
Copy link

@ahrtr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify bafc656 link true /test pull-etcd-verify
pull-etcd-unit-test-amd64 bafc656 link true /test pull-etcd-unit-test-amd64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants