Leases wait for entries to be applied #13690

ahrtr · 2022-02-12T00:26:11Z

Fix issue 13675.

When etcdserver receives a LeaseRenew request, it may be still in progress of processing the LeaseGrantRequest on exact the same leaseID. Accordingly it may return a TTL=0 to client due too the leaseID not found error. Please see my analysis in the issue 13675.

Before I head on adding an e2e or integration test, I'd like to get more feedback. cc @serathius @ptabor @spzala

@liberize, please kindly let me know whether this resolves your issue. Please clone my branch, build etcd binaries and verify the fix, thanks. I verified in my local environment, and confirmed that it works.

codecov-commenter · 2022-02-12T01:40:03Z

Codecov Report

Merging #13690 (d5835c9) into main (7c472a9) will decrease coverage by 0.09%.
The diff coverage is 80.00%.

❗ Current head d5835c9 differs from pull request most recent head 5360780. Consider uploading reports for the commit 5360780 to get more accurate results

@@            Coverage Diff             @@
##             main   #13690      +/-   ##
==========================================
- Coverage   72.68%   72.58%   -0.10%     
==========================================
  Files         467      467              
  Lines       38278    38291      +13     
==========================================
- Hits        27822    27795      -27     
- Misses       8653     8685      +32     
- Partials     1803     1811       +8

Flag	Coverage Δ
all	`72.58% <80.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
api/v3rpc/rpctypes/error.go	`90.47% <ø> (ø)`
server/etcdserver/api/v3rpc/util.go	`74.19% <ø> (ø)`
server/etcdserver/errors.go	`0.00% <ø> (ø)`
server/etcdserver/v3_server.go	`78.44% <80.00%> (-0.22%)`	⬇️
server/proxy/grpcproxy/register.go	`69.76% <0.00%> (-9.31%)`	⬇️
client/v3/leasing/util.go	`91.66% <0.00%> (-6.67%)`	⬇️
client/v3/concurrency/mutex.go	`61.64% <0.00%> (-5.48%)`	⬇️
client/v3/concurrency/session.go	`88.63% <0.00%> (-4.55%)`	⬇️
client/v3/leasing/cache.go	`87.77% <0.00%> (-3.89%)`	⬇️
client/pkg/v3/testutil/leak.go	`67.25% <0.00%> (-2.66%)`	⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c472a9...5360780. Read the comment docs.

liberize · 2022-02-12T03:18:33Z

verified. it works.

ahrtr · 2022-02-12T07:08:25Z

I found a similar issue 6978, which was fixed by PR 7015 more than 5 years ago.

It only fixed the HTTP entries. etcd doesn't support gRPC that time, and the fix isn't ported to gRPC later. So I just ported the fix for gRPC as well.

Will update the existing test cases TestV3LeaseRenewStress and TestV3LeaseTimeToLiveStress later.

ahrtr · 2022-02-13T22:57:04Z

I updated the integration test and the 3.6 changelog. This PR is ready for review. @serathius @spzala @ptabor

ahrtr · 2022-02-14T22:03:32Z

Just rebased this PR.

server/etcdserver/v3_server.go

ahrtr · 2022-02-16T00:35:59Z

The semaphoreci failure is due to docker hub rate limitation. See error below,

Step 1/14 : FROM ubuntu:21.10
toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
make[1]: *** [build-docker-test] Error 1
make[1]: Leaving directory `/home/runner/etcd'
make: *** [ensure-docker-test-image-exists] Error 2

Just raised an issue to track this failure.

ahrtr · 2022-02-25T05:59:49Z

Resolved all comments. PTAL @serathius @ptabor @spzala

ahrtr · 2022-02-25T08:50:48Z

Just rebased this PR.

server/etcdserver/v3_server.go

serathius

LGTM, based on fact that this restores behavior from V2 api (bothApplyWait and 1 second timeout).

ahrtr · 2022-03-03T01:39:08Z

cc @spzala @ptabor

server/etcdserver/v3_server_test.go

serathius · 2022-04-04T13:54:16Z

I think we need @ptabor opinion here. In #13868 (comment) he brought up that making LeaseLeases linealizable is a breaking change, so it would be worth discussing it here again.

serathius · 2022-04-04T13:54:30Z

cc @endocrimes

serathius · 2022-04-04T16:16:47Z

Pasting my discussion with @endocrimes about implementing LeaseList.

danielle 5:38 PM
o/ - trying to understand when I’d want to use linearizableReadNotify vs forwarding a request to the leader.
I’m guessing the intention is that linearizableReadNotify is a relatively cheap way to shard reads over etcd servers, but it’s not particularly well documented, so I’m kinda piecing that together from impl - and don’t wanna make too many guesses yet 😅

serathius 6:12 PM
My guess is that we should use linearizableReadNotify . Forwarding request looks for me like a legacy from V2 code.

In v2, revoked leases didn't go through raft, leader just send SYNC request through raft that didn't specify what leases should be revoked. This means that each member could remove different leases depending on time on server. So all requests about leases still needed to go through leader. I assume that when implementing V3 api this approach was blindly copied. Same problem affects other lease methods see #13690

In V3 revoking a lease goes through raft. RevokeLease entry specifies what lease is removed and all members need to agree through qorum. This means that it should be enough to use linearizableReadNotify aka (check if member is up to day with leader apply index) and the return the local response without contacting the leader.

I'm not super familiar with lease code and this is just based on quickly reading through the code. We should double check with @Piotr Tabor and @spzala. Looking at #13690 I think that we should open an issue which describes this incorrect behavior in V3 API and removes going through leader.

ahrtr · 2022-04-04T19:05:36Z

What this PR tries to resolve is definitely a BUG, and I think it's different to what we discussed on the ListLease.

ahrtr · 2022-04-07T02:10:04Z

The following test flake should can be resolved by this PR.



        lease_test.go:326: 
            	Error Trace:	lease_test.go:326
            	            				execute.go:29
            	            				asm_amd64.s:1581
            	Error:      	Not equal: 
            	            	expected: -1
            	            	actual  : 19
            	Test:       	TestLeaseGrantRevoke/PeerTLS

ahrtr · 2022-04-08T08:34:28Z

@ptabor @serathius @spzala Could you take a look at this PR? Once it's merged, I can backport to 3.5.

serathius · 2022-04-08T12:21:12Z

I talked with @ptabor and we agreed that the current lease implementation is bad and fact that is not linearizable is a big problem. This PR addresses some of the issues but not all of them (checking isLeader is not enough to guarantee linearizablity). We also think that this is an bug that should be fixed in older releases (definetly v3.5 not sure if v3.4).

I will work on reproduction and create a proposal to rewrite the API for v3.5.3. @endocrimes this will also cover your work on LeaseLeases and LeaseList.

serathius

We will use different approach to fix the issue.

ahrtr · 2022-04-08T13:20:09Z

Sounds good to me. Can we move this one out of 3.5.3? I don't think it's a blocker for 3.5.3

When etcdserver receives a LeaseRenew request, it may be still in progress of processing the LeaseGrantRequest on exact the same leaseID. Accordingly it may return a TTL=0 to client due to the leaseID not found error. So the leader should wait for the appliedID to be available before processing client requests.

ptabor

I did another round, and I think this patch is the best we can get in the shorterm:

It does NOT fixes issues when leader changes, and it's not the goal of this patch.
waitApplied index seems to not 'get stuck' as I initially thought when the member is disconnected, as ApplyWait() is just waiting whether current CommitteIndex (already received hardState) is applied. This process happens when all the needed entries are in the hard-state, so the application process should eventually converge...
The solution that changes reporting 'TTL' does not address race with 'Grant' that the problem solves.

ahrtr · 2022-04-11T08:30:42Z

Thanks @ptabor for the feedback, which basically makes sense.

Just one comment/question on your comment It does NOT fixes issues when leader changes, and it's not the goal of this patch.. Just as I explained in 13915#issuecomment-1094491226, leader changing during the LeaseRenew shouldn't be a problem. Could you elaborate why do you think it's a problem?

serathius · 2022-04-11T10:08:18Z

Just as I explained in 13915#issuecomment-1094491226, leader changing during the LeaseRenew shouldn't be a problem. Could you elaborate why do you think it's a problem?

As explained in #13915, checking isLeader doesn't makes sense outside raft protocol. Leader election timeout of 1 second (default), means that it will take the leader up to 1 second to notice that it was disconnected from cluster, during that time check isLeader will still return true and leases will be served it.

Based on @ptabor comment I think that this change makes sense and addresses one part of the problem I missed, possibility that there is a LeaseRevoke request in hard state that is waiting to be applied. With a need to cut v3.5.3 soon, I think this is the best fix we can prepare for now.

serathius · 2022-04-11T11:41:53Z

FYI integration-1-cpu workflow is failing for the third time. Possibly not related to this issue, however still very worrying as the error in the last one was

logger.go:130: 2022-04-11T10:16:18.540Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync158866305/002/etcd3711313259", "error": "backend.ConsistentIndex (8) must be >= last snapshot index (24)"}

Possibly this is result of enabling ETCD_VERIFY in tests.

ahrtr · 2022-04-11T12:17:19Z

FYI integration-1-cpu workflow is failing for the third time. Possibly not related to this issue, however still very worrying as the error in the last one was
logger.go:130: 2022-04-11T10:16:18.540Z	FATAL	m0	Verification failed	{"member": "m0", "data-dir": "/tmp/TestV3WatchRestoreSnapshotUnsync158866305/002/etcd3711313259", "error": "backend.ConsistentIndex (8) must be >= last snapshot index (24)"}
Possibly this is result of enabling ETCD_VERIFY in tests.

I will take a deep dive into this tomorrow.

ahrtr · 2022-04-11T12:21:14Z

Yes, the failure should not be related to this PR. Previously all test were gree, I just rebased this PR.

The test are green again.

serathius · 2022-04-11T13:45:12Z

@ahrtr Can you backport the change to release-3.5?

ahrtr · 2022-04-11T20:08:47Z

@ahrtr Can you backport the change to release-3.5?

Sure, will do today.

liberize mentioned this pull request Feb 12, 2022

concurrent.Session closed immediately after creation #13675

Closed

ahrtr force-pushed the lease_renew_linearizable branch from a5ba619 to a641fde Compare February 12, 2022 07:04

ahrtr force-pushed the lease_renew_linearizable branch 4 times, most recently from 6092356 to 05b4471 Compare February 13, 2022 22:38

ahrtr force-pushed the lease_renew_linearizable branch from 05b4471 to e232d87 Compare February 14, 2022 21:59

ahrtr mentioned this pull request Feb 15, 2022

etcd server DB out of sync undetected #12535

Closed

serathius reviewed Feb 15, 2022

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

serathius reviewed Feb 15, 2022

View reviewed changes

server/etcdserver/v3_server.go Show resolved Hide resolved

serathius reviewed Feb 15, 2022

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

ahrtr force-pushed the lease_renew_linearizable branch 3 times, most recently from 68f90a6 to 314db9c Compare February 15, 2022 23:37

ahrtr force-pushed the lease_renew_linearizable branch 2 times, most recently from 41cec48 to fb40b4e Compare February 17, 2022 23:25

ahrtr force-pushed the lease_renew_linearizable branch from fb40b4e to 24c1f4e Compare February 25, 2022 08:14

serathius reviewed Feb 25, 2022

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

ahrtr force-pushed the lease_renew_linearizable branch from 24c1f4e to 5360780 Compare March 2, 2022 02:48

serathius approved these changes Mar 2, 2022

View reviewed changes

spzala reviewed Mar 9, 2022

View reviewed changes

server/etcdserver/v3_server_test.go Outdated Show resolved Hide resolved

ahrtr force-pushed the lease_renew_linearizable branch 2 times, most recently from d1d8a94 to 2493d87 Compare April 2, 2022 06:54

serathius mentioned this pull request Apr 6, 2022

Plans for v3.5.3 release #13894

Closed

28 tasks

serathius requested changes Apr 8, 2022

View reviewed changes

serathius mentioned this pull request Apr 8, 2022

[wip] Linearizable LeaseLeases #13882

Closed

ahrtr mentioned this pull request Apr 9, 2022

Make leases linearizable (excluding TTL) #13915

Open

ahrtr added 2 commits April 10, 2022 14:44

set an separate applyTimeout for the waitAppliedIndex

1b3d6cb

ahrtr force-pushed the lease_renew_linearizable branch from 2493d87 to 1b3d6cb Compare April 10, 2022 06:45

ptabor approved these changes Apr 11, 2022

View reviewed changes

serathius changed the title ~~Support linearizable renew lease~~ Leases wait for entries to be applied Apr 11, 2022

serathius approved these changes Apr 11, 2022

View reviewed changes

serathius merged commit dd08e15 into etcd-io:main Apr 11, 2022

ahrtr mentioned this pull request Apr 12, 2022

[3.5] Support linearizable renew lease #13932

Merged

serathius added the release/v3.6 label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leases wait for entries to be applied #13690

Leases wait for entries to be applied #13690

ahrtr commented Feb 12, 2022 •

edited

Loading

codecov-commenter commented Feb 12, 2022 •

edited

Loading

liberize commented Feb 12, 2022

ahrtr commented Feb 12, 2022

ahrtr commented Feb 13, 2022

ahrtr commented Feb 14, 2022

ahrtr commented Feb 16, 2022 •

edited

Loading

ahrtr commented Feb 25, 2022

ahrtr commented Feb 25, 2022

serathius left a comment

ahrtr commented Mar 3, 2022

serathius commented Apr 4, 2022

serathius commented Apr 4, 2022

serathius commented Apr 4, 2022 •

edited

Loading

ahrtr commented Apr 4, 2022

ahrtr commented Apr 7, 2022

ahrtr commented Apr 8, 2022

serathius commented Apr 8, 2022

serathius left a comment

ahrtr commented Apr 8, 2022

ptabor left a comment

ahrtr commented Apr 11, 2022

serathius commented Apr 11, 2022

serathius commented Apr 11, 2022 •

edited

Loading

ahrtr commented Apr 11, 2022

ahrtr commented Apr 11, 2022 •

edited

Loading

serathius commented Apr 11, 2022

ahrtr commented Apr 11, 2022

Leases wait for entries to be applied #13690

Leases wait for entries to be applied #13690

Conversation

ahrtr commented Feb 12, 2022 • edited Loading

codecov-commenter commented Feb 12, 2022 • edited Loading

Codecov Report

liberize commented Feb 12, 2022

ahrtr commented Feb 12, 2022

ahrtr commented Feb 13, 2022

ahrtr commented Feb 14, 2022

ahrtr commented Feb 16, 2022 • edited Loading

ahrtr commented Feb 25, 2022

ahrtr commented Feb 25, 2022

serathius left a comment

Choose a reason for hiding this comment

ahrtr commented Mar 3, 2022

serathius commented Apr 4, 2022

serathius commented Apr 4, 2022

serathius commented Apr 4, 2022 • edited Loading

ahrtr commented Apr 4, 2022

ahrtr commented Apr 7, 2022

ahrtr commented Apr 8, 2022

serathius commented Apr 8, 2022

serathius left a comment

Choose a reason for hiding this comment

ahrtr commented Apr 8, 2022

ptabor left a comment

Choose a reason for hiding this comment

ahrtr commented Apr 11, 2022

serathius commented Apr 11, 2022

serathius commented Apr 11, 2022 • edited Loading

ahrtr commented Apr 11, 2022

ahrtr commented Apr 11, 2022 • edited Loading

serathius commented Apr 11, 2022

ahrtr commented Apr 11, 2022

ahrtr commented Feb 12, 2022 •

edited

Loading

codecov-commenter commented Feb 12, 2022 •

edited

Loading

ahrtr commented Feb 16, 2022 •

edited

Loading

serathius commented Apr 4, 2022 •

edited

Loading

serathius commented Apr 11, 2022 •

edited

Loading

ahrtr commented Apr 11, 2022 •

edited

Loading