Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: restore tombstone index if it's first revision #19188

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Jan 14, 2025

The tombstone could be the only one available revision in database. It happens when all historical revisions have been deleted in previous compactions. Since tombstone revision is still in database, we should restore it as valid key index. Otherwise, we lost that deletion event if we try to compact on that revision.

REF: #19179

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fuweid
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.76%. Comparing base (ebb2b06) to head (ab83df5).
Report is 26 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/mvcc/key_index.go 65.51% <100.00%> (+0.81%) ⬆️
server/storage/mvcc/kvstore.go 89.86% <100.00%> (+0.10%) ⬆️

... and 29 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19188      +/-   ##
==========================================
- Coverage   68.83%   68.76%   -0.07%     
==========================================
  Files         420      420              
  Lines       35641    35642       +1     
==========================================
- Hits        24532    24511      -21     
- Misses       9687     9706      +19     
- Partials     1422     1425       +3     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebb2b06...ab83df5. Read the comment docs.

tests/e2e/watch_test.go Outdated Show resolved Hide resolved
tests/e2e/watch_test.go Outdated Show resolved Hide resolved
@serathius
Copy link
Member

How much this differs from #18089 ?

@ahrtr
Copy link
Member

ahrtr commented Jan 14, 2025

How much this differs from #18089 ?

As mentioned in #19179 (comment), it's an edge case that we missed when fixing #18089. etcdserver shouldn't silently drop any deletion event, it should either deliver the event to client or respond with an ErrCompact error.

But since we missed this edge case (etcdserver crash right after finishing the compaction but before setting the finishedCompactRev), the etcdserver might silently drops deletion events.

@fuweid
Copy link
Member Author

fuweid commented Jan 14, 2025

How much this differs from #18089 ?

I think it's part of #18089 which I miss in the fix patch. 😂
Just wondering that why robustness test just catches it after two months 🤔 The recent changes look irrelative to it.

@fuweid
Copy link
Member Author

fuweid commented Jan 14, 2025

/retest

@serathius
Copy link
Member

Just wondering that why robustness test just catches it after two months 🤔 The recent changes look irrelative to it.

Don't know :(
Looks like it was causing flakes in presubmits so we never looked at them during robustness meeting because they are assumed to fail due to issues with PRs https://testgrid.k8s.io/sig-etcd-robustness#pull-etcd-robustness-arm64&include-filter-by-regex=Issue17780 . I randomly looked at one PR that had this failure. Filled #19189 to discuss and take lessons from this.

server/storage/mvcc/key_index.go Outdated Show resolved Hide resolved
server/storage/mvcc/kvstore.go Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented Jan 16, 2025

Overall looks good, with only two minor comments. Thanks @fuweid

The tombstone could be the only one available revision in database.
It happens when all historical revisions have been deleted in previous
compactions. Since tombstone revision is still in database, we should
restore it as valid key index. Otherwise, we lost that event.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2025

/retest

@k8s-ci-robot
Copy link

k8s-ci-robot commented Jan 17, 2025

@fuweid: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-e2e-arm64 ab83df5 link true /test pull-etcd-e2e-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

require.NoError(t, err)
defer clus.Close()

c1 := newClient(t, clus.EndpointsGRPC(), cfg.Client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why new client?

case watchResp := <-watchChan:
require.Len(t, watchResp.Events, 1)

require.Equal(t, mvccpb.DELETE, watchResp.Events[0].Type)
Copy link
Member

@serathius serathius Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little surprised that we are able to restore tombstone without any information about key creation, but apparently fields like CreateRevision, Version are set to 0 on delete. Still I think it would be nice if it was obvious in the test.

Instead of comparing subset of fields to some hardcoded values, can we compare whole event to matching event from before the compaction?

   // new lines
	var deleteEvent *clientv3.Event
	select {
	case watchResp := <-c1.Watch(ctx, firstKey, clientv3.WithRev(deleteResp.Header.Revision)):
		require.Len(t, watchResp.Events, 1)
		deleteEvent = watchResp.Events[0]
	case <-time.After(100 * time.Millisecond):
		t.Fatal("timed out getting watch response")
	}
   // old lines
	require.NoError(t, clus.Procs[0].Failpoints().SetupHTTP(ctx, "compactBeforeSetFinishedCompact", `panic`))

	t.Logf("COMPACT rev=%d", deleteResp.Header.Revision)
	_, err = c1.KV.Compact(ctx, deleteResp.Header.Revision, clientv3.WithCompactPhysical())
	require.Error(t, err)

	require.NoError(t, clus.Restart(ctx))

	c2 := newClient(t, clus.EndpointsGRPC(), cfg.Client)
	defer c2.Close()

	watchChan := c2.Watch(ctx, firstKey, clientv3.WithRev(deleteResp.Header.Revision))
	select {
	case watchResp := <-watchChan:
		require.Len(t, watchResp.Events, 1)
		// changed lines
		require.Equal(t, deleteEvent, watchResp.Events[0])
	case <-time.After(100 * time.Millisecond):
		// we care only about the first response, but have an
		// escape hatch in case the watch response is delayed.
		t.Fatal("timed out getting watch response")
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants