Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load all leases from backend #13676

Merged
merged 2 commits into from
Feb 14, 2022
Merged

Load all leases from backend #13676

merged 2 commits into from
Feb 14, 2022

Conversation

chaochn47
Copy link
Member

Fix #12535

Reproduce integration test is on chaochn47#6
Manual e2e test is in #12535 (comment)

@serathius PTAL, thanks!

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2022

Codecov Report

Merging #13676 (1675df3) into main (bba3937) will decrease coverage by 0.06%.
The diff coverage is 36.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #13676      +/-   ##
==========================================
- Coverage   72.77%   72.70%   -0.07%     
==========================================
  Files         465      465              
  Lines       37865    37917      +52     
==========================================
+ Hits        27556    27569      +13     
- Misses       8535     8562      +27     
- Partials     1774     1786      +12     
Flag Coverage Δ
all 72.70% <36.36%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/storage/schema/lease.go 69.23% <36.36%> (-12.59%) ⬇️
client/v3/txn.go 73.33% <0.00%> (-26.67%) ⬇️
server/proxy/grpcproxy/register.go 69.76% <0.00%> (-9.31%) ⬇️
server/auth/simple_token.go 84.33% <0.00%> (-5.43%) ⬇️
raft/rafttest/node.go 95.00% <0.00%> (-5.00%) ⬇️
server/storage/mvcc/watchable_store.go 85.86% <0.00%> (-3.27%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 74.47% <0.00%> (-3.13%) ⬇️
server/etcdserver/api/v3election/election.go 66.66% <0.00%> (-2.78%) ⬇️
server/etcdserver/api/v3rpc/watch.go 85.90% <0.00%> (-2.02%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 71.30% <0.00%> (-1.74%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bba3937...1675df3. Read the comment docs.

@chaochn47 chaochn47 changed the title reject lease grant with negative lease ID Reject lease API with negative lease ID Feb 10, 2022
@chaochn47
Copy link
Member Author

chaochn47 commented Feb 10, 2022

The test failures are not related to current PR.

Fail at test_e2e TestSerializableReadWithoutQuorum during the Coverage/coverage test while passing the same test in E2E/test. This test case was added in the latest merged PR #13525.

--- FAIL: TestSerializableReadWithoutQuorum (10.15s)
    logger.go:130: 2022-02-10T07:32:10.353Z	INFO	starting server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:10.353Z	INFO	spawning process	{"args": ["--name", "test-0", "--listen-client-urls", "http://localhost:20000", "--advertise-client-urls", "http://localhost:20000", "--listen-peer-urls", "https://localhost:20001", "--initial-advertise-peer-urls", "https://localhost:20001", "--initial-cluster-token", "new", "--data-dir", "/tmp/TestSerializableReadWithoutQuorum846452218/001", "--snapshot-count", "100000", "--peer-auto-tls", "--initial-cluster", "test-0=https://localhost:20001,test-1=https://localhost:20006,test-2=https://localhost:20011", "-test.coverprofile=e2e.1644478330353815367.coverprofile", "-test.outputdir=/home/runner/work/etcd/etcd/covdir/e2e"], "working-dir": "/home/runner/work/etcd/etcd/tests/e2e"}
    logger.go:130: 2022-02-10T07:32:15.416Z	INFO	started server.	{"name": "test-0"}
    ctl_v3_test.go:273: ---testFunc logic DONE
    ctl_v3_test.go:286: closing test cluster...
    logger.go:130: 2022-02-10T07:32:15.455Z	INFO	closing test cluster...
    logger.go:130: 2022-02-10T07:32:15.455Z	INFO	stoping server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:15.483Z	INFO	stopped server.	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:15.483Z	INFO	closing server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:15.483Z	INFO	stoping server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:15.483Z	INFO	removing directory	{"data-dir": "/tmp/TestSerializableReadWithoutQuorum846452218/001"}
    logger.go:130: 2022-02-10T07:32:15.483Z	INFO	closed test cluster.
    ctl_v3_test.go:289: closed test cluster...
    ctl_v3_kv_no_quorum_test.go:69: linearizableReadTest is expected to fail, but it succeeded
    ctl_v3_test.go:273: ---testFunc logic DONE
    ctl_v3_test.go:286: closing test cluster...
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	closing test cluster...
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	stoping server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	closing server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	stoping server...	{"name": "test-0"}
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	removing directory	{"data-dir": "/tmp/TestSerializableReadWithoutQuorum846452218/001"}
    logger.go:130: 2022-02-10T07:32:20.503Z	INFO	closed test cluster.
    ctl_v3_test.go:289: closed test cluster...

Fail grpcproxy-integration-test TestAuthority during the grpcProxy-tests/test test while passing the same test in integration-test. Looks like this specific test is duplicated cause it doesn't explicitly exclude the grpc-proxy build tag.

--- FAIL: TestAuthority (2.45s)
    --- FAIL: TestAuthority/Size:_3,_Scenario:_"http://address[:port]" (0.00s)
        logger.go:130: 2022-02-10T07:19:21.859Z	INFO	m-1	LISTEN GRPC	{"member": "m-1", "grpcAddr": "127.0.0.1:33040", "m.Name": "m-1"}
        logger.go:130: 2022-02-10T07:19:21.860Z	INFO	m0	LISTEN GRPC	{"member": "m0", "grpcAddr": "127.0.0.1:33071", "m.Name": "m0"}
        cluster.go:693: listen failed on grpc socket 127.0.0.1:33071 (listen tcp 127.0.0.1:33071: bind: address already in use)

@serathius
Copy link
Member

serathius commented Feb 10, 2022

I'm still not sure what exactly this will fix. Let's try to describe the current behavior:

  • LeaseGrant accepts 3 ranges of lease ID:
    • Positive - This is a correct ID that will work correctly.
    • Zero - This will prompt Server to generate it for user, should result in positive ID,
      r.ID = int64(s.reqIDGen.Next() & ((1 << 63) - 1))
    • Negative - Value allowed in API, however possibly not expected and correctly handled.
  • initAndRecover loads only positive leases to memory. So if etcd is restarted lease with negative ID would not be loaded, resulting in lessor resulting returning "lease not found" for it.

    etcd/lease/lessor.go

    Lines 311 to 315 in a905430

    l := le.leaseMap[id]
    if l == nil {
    le.mu.Unlock()
    return ErrLeaseNotFound
    }

Going to the issues reported in #12535

  • Etcd state gets corrupted and logs logs like failed to apply request "header:<ID:9133152155410395282 > lease_revoke:<id:-6f4bd77510edab61>" with response "size:27" took (69.202µs) to execute, err is lease not found, that share commonality of reporting negative lease ID.

@serathius
Copy link
Member

This leads me to conclude that main problem is related to API allowing negative lease ID, however initAndRecover not handling it correctly.

However we cannot just disable negative leaseID as this would be a breaking change in the API. I think we should rather consider fixing initAndRecover instead.

@chaochn47 chaochn47 changed the title Reject lease API with negative lease ID Load all leases from backend Feb 11, 2022
Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaochn47
Copy link
Member Author

Addressed all comments, squashed all commits into one and added one line in CHANGELOG-3.6.md. PTAL, thanks!!

I think we should backport this bug fix to release-3.3, release-3.4 and release-3.5 branches. WDYT? @serathius @spzala

@spzala
Copy link
Member

spzala commented Feb 14, 2022

Addressed all comments, squashed all commits into one and added one line in CHANGELOG-3.6.md. PTAL, thanks!!

I think we should backport this bug fix to release-3.3, release-3.4 and release-3.5 branches. WDYT? @serathius @spzala

@chaochn47 thank you for addressing comments and squash, it looks great. +1 to backporting.

@serathius serathius merged commit 09e35a4 into etcd-io:main Feb 14, 2022
@chaochn47 chaochn47 deleted the fix_12535 branch February 14, 2022 22:12
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Feb 22, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Feb 22, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Feb 22, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Feb 22, 2022
serathius added a commit that referenced this pull request Mar 8, 2022
backport 3.5: #13676 load all leases from backend
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Jul 19, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Jul 19, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Jul 19, 2022
chaochn47 added a commit to chaochn47/etcd that referenced this pull request Jul 19, 2022
Signed-off-by: Chao Chen <chaochn@amazon.com>
ahrtr added a commit that referenced this pull request Jul 20, 2022
backport 3.5: #13676 load all leases from backend
tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Sep 7, 2022
tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Sep 8, 2022
Signed-off-by: Chao Chen <chaochn@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

etcd server DB out of sync undetected
4 participants