Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ implement failure domain picking #2126

Closed
wants to merge 1 commit into from

Conversation

chuckha
Copy link
Contributor

@chuckha chuckha commented Jan 22, 2020

Signed-off-by: Chuck Ha chuckh@vmware.com

What this PR does / why we need it:
This PR implements failure domains in the kubeadm control plane controller.

It guarantees that control plane nodes are spread evenly across available failure domains.

This adds support for failure domains to CAPD (as in, the fields, you cannot have multiple failure domains on a local machine) and also adds to the existing e2e test to ensure that machines are evenly distributed across the failure domains.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #1756

/assign @detiber
/cc @dlipovetsky

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2020
machineList, err := r.getMachines(ctx, types.NamespacedName{Namespace: cluster.GetNamespace(), Name: cluster.GetName()})
if err != nil {
r.Log.Info("Failure domain is not being used due to a kubernetes client error", "error", err)
return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it might be better to not swallow this error and bubble it up, so that we don't make poor scaling decisions just because we could not list machines.

As an aside, do we need to take into account the case of no machines here and swallow a "NotFound" error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't quite follow. I'm approaching this from the perspective that if we can't pick failure domains then we can't pick failure domains and not setting a failure domain is fine. Should this be a strict requirement and we should fail if we can't pick a failure domain? Not sure what you mean by making a poor scaling decision.

Signed-off-by: Chuck Ha <chuckh@vmware.com>
}

// PickMost returns the failure domain with the most number of machines.
func (f *FailureDomainPicker) PickMost(failureDomains clusterv1.FailureDomains, machineList *clusterv1.MachineList) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to change this to return (string, bool) instead? That way we could check the value of bool to determine if there are no failure domains to choose from rather than using an empty string here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

six of one half dozen of the other imo, unless you have a use case for needing to know that there was no failure domain at all?

I'd be way more inclined to move that logic out of this piece and into the reconciler. We could just completely skip trying to figure out failure domains if there are no failure domains to figure out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my biggest concern is to differentiate the current use of "" to indicate a problem deducing the correct failure domain to use vs no defined failure domains.

A tangential concern that I'm starting to have is how do we handle a provider potentially adding failure domains to an existing Cluster? In that case we'd currently have a mix of Machines with no specified failure domain and any new Machines would get a failure domain.

One potential solution to that would be to the make it the responsibility of the Machine controller to update Spec.FailureDomain in the event that it is not currently set and the status indicates that it has a failure domain (potentially).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds like we have some fleshing out to do of this problem in general. I'll close the PR and file an issue where I outline a more holistic solution.

}

// PickFewest returns the failure domain with the fewest number of machines.
func (f *FailureDomainPicker) PickFewest(failureDomains clusterv1.FailureDomains, machineList *clusterv1.MachineList) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to change this to return (string, bool) instead? That way we could check the value of bool to determine if there are no failure domains to choose from rather than using an empty string here.

@k8s-ci-robot
Copy link
Contributor

@chuckha: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-make 5e493ad link /test pull-cluster-api-make
pull-cluster-api-integration 5e493ad link /test pull-cluster-api-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@chuckha
Copy link
Contributor Author

chuckha commented Jan 22, 2020

/close

There is potential to re-open this if we decide this is the only scenario we want to tackle for v1a3

@k8s-ci-robot
Copy link
Contributor

@chuckha: Closed this PR.

In response to this:

/close

There is potential to re-open this if we decide this is the only scenario we want to tackle for v1a3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants