Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve membership reconfiguration test coverage #9150

Closed
5 of 8 tasks
gyuho opened this issue Jan 16, 2018 · 4 comments
Closed
5 of 8 tasks

Improve membership reconfiguration test coverage #9150

gyuho opened this issue Jan 16, 2018 · 4 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Jan 16, 2018

Membership reconfiguration is critical in operating etcd cluster, which etcd-operator heavily depends on. Although we already have a fair amount of tests around cluster APIs (member add/remove), we do not test every possible configurations. I doubt current member APIs have any serious bugs but always good to have edge cases covered and proactively prevent new bugs for future development.

Some of the missing test scenarios:

@gyuho gyuho changed the title Improve member reconfiguration test coverage Improve membership reconfiguration test coverage Jan 16, 2018
@gyuho
Copy link
Contributor Author

gyuho commented Jan 17, 2018

@jpbetz You might be interested in improving https://github.com/coreos/etcd/blob/master/integration/cluster_test.go.

We DO have scaling up/down membership tests, but still uses v2 API.

First good step would be using v3 API and make sure current test suites still pass.

@jpbetz
Copy link
Contributor

jpbetz commented Jan 17, 2018

This looks like the right place to start. Thanks @gyuho.

We had a discussion about a potential 1->2 scale up issue that is purely theoretical, but that I'd like to review: If a newly added member can be briefly unavailable right after a cluster is scaled up to include it--maybe due to replication of a large dataset--then the 1->2 scale up case would be problematic, because for that particular size change, the newly added member instantly becomes an essential member of the cluster, and if it is unavailable, the cluster is unavailable. In theory this is only for 1->2 scale ups, because for 2->3, 3->4, ..., even if the newly added member is briefly unavailable, the cluster can make progress on the RAFT log so long as network and all other nodes remain healthy.

@fanminshi
Copy link
Member

@jpbetz I think your thought is on the right track. The idea behind raft is that the leader needs to make sure that an entry must be committed/agreed in a majority (2/n + 1) of nodes. As long as majority nodes are up and running, then etcd is operational. So from 1 - > 2 case, the majority of 2 node is 2 nodes (2/2 + 1 = 2). then etcd cluster requires 2 nodes to be "running" in order to be operational. However, from 2 -> 3. The majority is 2 (3/2 + 1 = 2). The the etcd cluster will be operationally regardless if the third member joins or not since there are 2 nodes already "running". Hence, the new third member can be briefly unavailable without causing the etcd cluster to be unavailable. This is same for 3->4, ....

When I say "running", it means that etcd nodes can talk to each other and are able to elect an leader.

@gyuho
Copy link
Contributor Author

gyuho commented Sep 17, 2018

Will be addressed with learner feature.

@gyuho gyuho closed this as completed Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants