Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve machine/controller.Reconcile to intelligently block deleting … #350

Merged
merged 1 commit into from
Jun 15, 2018

Conversation

spew
Copy link
Contributor

@spew spew commented Jun 15, 2018

…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters.

What this PR does / why we need it:
This PR improves the machine/controller package's deletion protection to only apply if the node that the machine controller is located on is the node associated with the machine to be deleted. This makes it possible for master nodes to be deleted by machine controller. This also removes the need for the "InCluster" flag so it has been removed.

The way that the machine controller determines if it is on the same node is by using the downward API to inject the name of the actual node that the pod is running on into a NODE_NAME environment variable. You can see that this environment variable is loaded in the controller.Init function. An edge case is handled where the node's name matches but the UID does not match (imagine that our external cluster has a node name of "minikube" and for whatever reason our node in the cluster has a name of "minikube", we'll still be able to delete because the UIDs won't match.

Special notes for your reviewer:

Release note:

MachineController can now delete nodes associated with master machines. 

@kubernetes/kube-deploy-reviewers

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spew

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 15, 2018
@spew
Copy link
Contributor Author

spew commented Jun 15, 2018

/assign @roberthbailey @k4leung4

@k4leung4
Copy link
Contributor

"sigs.k8s.io/cluster-api/pkg/controller/sharedinformers"
"sigs.k8s.io/cluster-api/util"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a single const you should just put it on one line instead of in a block:

const NodeNameEnvVar = "NODE_NAME"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

if c.nodeName == "" {
c.nodeName = os.Getenv(NodeNameEnvVar)
if c.nodeName == "" {
glog.Infof("environment variable %v is not set, this controller will not protect against deleting its own machine", NodeNameEnvVar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a warning instead of info (since we generally do expect the env var to be set)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call.

if cfg.ControllerConfig.InCluster && util.IsMaster(machine) {
glog.Infof("skipping reconciling master machine object %v", name)
if !c.isDeleteAllowed(machine) {
glog.Infof("skipping reconciling of machine object %v", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code but you should fix it while here: capitalize the first line of log messages (unlike error strings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this instance, will follow up in another PR for the other instances.

node, err := c.kubernetesClientSet.CoreV1().Nodes().Get(c.nodeName, metav1.GetOptions{})
if err != nil {
glog.Infof("unable to determine if controller's node is associated with machine '%v', error getting node named '%v': %v", machine.Name, c.nodeName, err)
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return false? I'm imagining a case where temporarily lose connectivity (what if the apiserver pod is being restarted?) or a single http request failing causes the controller to delete itself. If we return false will the reconcile loop try again? Or do we need retry logic around this call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated this one back and forth and finally decided that we could err on the side of deleting in the "api server not available" scenarios. I biased for optimizing the "delete cluster" experience. For example, if for whatever reason you are trying to delete a cluster that has the API server that doesn't work or is crash looping or the pod is deleted then you would be unable to delete the machine controller node if this check returned false.

You are correct -- that the API server can be temporarily unavailable, timing out, etc, can & will happen. However, I chose to not deal with it because the correct answer is retries.

We aren't currently doing retries in this file and before introducing it I think I'd prefer a more holistic approach (i.e. the client should support a retry policy that wraps all calls and makes it so no one needs to write retry code explicitly). The reason why I think we should wait is that this is not yet production code and therefore it is probably better to optimize for code readability and ease of making future changes rather than stability of the system (the system which doesn't even exist yet).

Happy to hear your thoughts. And maybe kubernetes clients already support retries and we just need to turn some stuff on or import it into the cluster API clients.

glog.Infof("unable to determine if controller's node is associated with machine '%v', error getting node named '%v': %v", machine.Name, c.nodeName, err)
return true
}
// don't allow this controller to delete its underlying node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comments should be full sentences, starting with a capital letter and ending with a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

nodeUID: "another-uid",
},
{
name: "Delete machine, controller on the same node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there is at least one test missing. Maybe a specific set of tests for isDeleteAllowed would make sense as that function has a bunch of branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, i'll add it.

@spew
Copy link
Contributor Author

spew commented Jun 15, 2018

Updated post review comments.

@@ -25,14 +25,16 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"

"os"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard imports should always be in the first group
https://github.com/golang/go/wiki/CodeReviewComments#imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed -- i'll try to figure out how to automate this.

@k4leung4
Copy link
Contributor

lgtm

…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters.
@roberthbailey
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2018
@k8s-ci-robot k8s-ci-robot merged commit 5997a01 into kubernetes-sigs:master Jun 15, 2018
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
…a node associated with the controller machine. This will enable external machine controllers to delete any machine/node, including masters. (kubernetes-sigs#350)
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
…x-yaml-out-dir

Fix the -o OUT_DIR opt for generating yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants