Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/integration: deflake Corruption cases #14824

Merged

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Nov 22, 2022

If the corrupted member has been elected as leader, the memberID in alert response won't be the corrupted one. It will be a smaller follower ID since the raftCluster.Members always sorts by ID. We should check the leader ID and decide to use which memberID.

Fixes: #14823

Signed-off-by: Wei Fu fuweid89@gmail.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

If the corrupted member has been elected as leader, the memberID in alert
response won't be the corrupted one. It will be a smaller follower ID since
the raftCluster.Members always sorts by ID. We should check the leader
ID and decide to use which memberID.

Fixes: etcd-io#14823

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

@serathius
Copy link
Member

This flake shows that there is a issue with meaning of MemberID field. For corruption alarm it depend on which member is corrupted:

  • If follower is corrupted, leader will set MemberID to corrupted follower.
  • If leader is corrupted, leader will set MemberID to follower with lower ID (assuming that both followers are up during check).

Might be worth creating an issue to fix it, however I don't think this can be easily fixed.
cc @ahrtr @ptabor

At least I don't a solution now. First thoughts: validating that which member is corrupted requires separate quorum of members. In 3 node clusters you need all 3 members up (2 members for quorum + 1 corrupted member), in 5 node clusters you need (3 members for quorum and 1 corrupted member).

One possible fix would be reporting MemberId equal 0 and delaying setting true MemberId until enough members are up.

@serathius serathius merged commit a87c993 into etcd-io:main Nov 22, 2022
@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2022

Thanks @fuweid for fixing the flaky test case, but it should be a product/etcd issue instead of a test case issue, so we should fix product code. At least, we should leave a TODO comment in the modified test case.

Actually I raised this comment when I was reviewing the code more than 4 months ago, #14120 (comment). We should depend on quorum to identify the corrupted member, let me deliver a PR for this.

ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 23, 2022
The change did in etcd-io#14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@fuweid fuweid deleted the deflake-TestCompactHashCheckDetectCorruption branch November 23, 2022 06:16
@fuweid
Copy link
Member Author

fuweid commented Nov 23, 2022

@ahrtr Looking forward to your pr!

ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 25, 2022
The change did in etcd-io#14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 26, 2022
The change did in etcd-io#14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 26, 2022
The change did in etcd-io#14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
serathius pushed a commit to serathius/etcd that referenced this pull request Dec 2, 2022
The change did in etcd-io#14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

[tests/integration] flake test case TestCompactHashCheckDetectCorruption
3 participants