Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine controller should sense the event that infra obj turns into notReady #1622

Closed
xrmzju opened this issue Oct 20, 2019 · 24 comments
Closed
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@xrmzju
Copy link
Member

xrmzju commented Oct 20, 2019

if the infra obj turns from ready to notReady and not been patched with any ErrorMsg/ErrReason, we only requeue the machine obj and stays in running state.

return errors.Wrapf(&capierrors.RequeueAfterError{RequeueAfter: externalReadyWait},

if the infra machcine is recoverable(like reboot), i don't think the infra controller should set any ErrorMsg/ErrReason on the infra resource.

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2019
@ncdc
Copy link
Contributor

ncdc commented Oct 21, 2019

I believe this is handled in a different place. The reconcile flow is:

  1. r.reconcileInfrastructure(ctx, m),
  2. infraConfig, err := r.reconcileExternal(ctx, m, &m.Spec.InfrastructureRef)
  3. errorReason, errorMessage, err := external.ErrorsFrom(obj)
    if err != nil {
    return nil, err
    }
    if errorReason != "" {
    machineStatusError := capierrors.MachineStatusError(errorReason)
    m.Status.ErrorReason = &machineStatusError
    }
    if errorMessage != "" {
    m.Status.ErrorMessage = pointer.StringPtr(
    fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",
    obj.GroupVersionKind(), obj.GetName(), errorMessage),
    )
    }

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

@ncdc ncdc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 21, 2019
@xrmzju
Copy link
Member Author

xrmzju commented Oct 21, 2019

I believe this is handled in a different place. The reconcile flow is:

  1. r.reconcileInfrastructure(ctx, m),
  2. infraConfig, err := r.reconcileExternal(ctx, m, &m.Spec.InfrastructureRef)
  3. errorReason, errorMessage, err := external.ErrorsFrom(obj)
    if err != nil {
    return nil, err
    }
    if errorReason != "" {
    machineStatusError := capierrors.MachineStatusError(errorReason)
    m.Status.ErrorReason = &machineStatusError
    }
    if errorMessage != "" {
    m.Status.ErrorMessage = pointer.StringPtr(
    fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",
    obj.GroupVersionKind(), obj.GetName(), errorMessage),
    )
    }

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

sry, i might not express my self clearly, if the infra machine turned into a permanent failure condition, the controller on the infra side should set errMsg/errReason, i agreed with this. but if the infra machine is recoverable(like reboot), the controller on the infra side(in my implemention) would just set the infra machine from ready to not ready, and machine controller in cluster-api could not sense it but stayed in running state.

@xrmzju
Copy link
Member Author

xrmzju commented Oct 21, 2019

and i have this situation discussed with @detiber on slack. he agreed with me too.
image

@ncdc
Copy link
Contributor

ncdc commented Oct 21, 2019

Ok, what you're pointing out makes sense. My apologies for being slightly confused 😄.

@ncdc ncdc added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Oct 21, 2019
@ncdc
Copy link
Contributor

ncdc commented Oct 21, 2019

/help

@k8s-ci-robot
Copy link
Contributor

@ncdc:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 21, 2019
@vincepri
Copy link
Member

For context, in v1alpha2 we chose not to recover from a failure case to avoid loops or going back and forth between states outlined in the machine proposal. This behavior aligns with the assumption that Machines are immutable, if you get a failure, the action the system expects an operator to take is to delete and recreate the Machine.

@detiber
Copy link
Member

detiber commented Oct 21, 2019

@vincepri There are currently plenty of recoverable errors that we encounter today, we just don't necessarily bubble them up in any way. For example if the AWS API is unavailable for any reason.

I don't think we should say that any error should be unrecoverable, nor should we say that we should resolve errors in the controllers. I think the case that is more grey (and potentially where this issue comes into play) is how to bubble up recoverable errors (that are resolved on the cloud provider side) to users without turning them into non-recoverable errors (ErrorMessage/ErrorReason).

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

@vincepri
Copy link
Member

vincepri commented Oct 21, 2019

Yes, we need to distinguish (somehow) between controller-retryable errors and failures.

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

It'll be up to the infrastructure provider to make sure a machine that's being rebooted isn't marked as failed. From CAPI perspective, a Machine that's being rebooted, it'll cause the Node to become not ready and fall from Running state to Provisioned.

Nevermind, https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller_phases.go#L60-L63 the code actually doesn't check for node readiness.

@ncdc
Copy link
Contributor

ncdc commented Oct 23, 2019

@vincepri to look up what we wrote in the v1a2 proposal around ready->notReady transitions.

@ncdc ncdc added this to the Next milestone Oct 23, 2019
@detiber
Copy link
Member

detiber commented Oct 23, 2019

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

@vincepri
Copy link
Member

look up what we wrote in the v1a2 proposal around ready->notReady transitions.

If we allow to flip from ready -> notReady, the phase would go back to provisioning as outlined in the logic here

if m.Status.BootstrapReady && !m.Status.InfrastructureReady {

I'm not super opposed to this behavior, but not in total favor either. I'd rather have the Machine controller detect that the linked node is not in a ready state or not by checking the NodeRef and flip the status to Provisioned. Which is in line with the state conditions outlined in the original proposal https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md#transition-conditions-3.

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

I'd like to avoid, at least for now, to keep track of previous states and introduce something like Conditions to understand how we arrived to a state. This would need to be in a separate CAEP, wdyt?

@ncdc
Copy link
Contributor

ncdc commented Oct 23, 2019

👍 for checking the node ref and flipping it back to provisioned for the time being, and doing a separate proposal around conditions (hint hint @timothysc 😄)

@nader-ziada
Copy link
Contributor

/assign

I'll work on this issue if nobody has any concerns.

@michaelgugino
Copy link
Contributor

I'm wholly against reversing machine states. The flow of the machine-controller and associated infra controller should be one-way.

If some machine that was previously ready (eg, became a node) has now failed, we'll see that indication on the node, and we remediate:

  1. Delete the machine-object
  2. Machine-controller cordon and drains node
  3. Backing instance is deleted from cloud
  4. MachineSet or other thing creates a replacement machine

Immutable infrastructure.

@detiber
Copy link
Member

detiber commented Oct 31, 2019

For me this is less about reversing states as much as giving users visibility into the underlying resources we are managing.

From a user's point of view (not taking into account automated remediation for the moment). If I'm looking at Cluster API and it is telling me that my Machine is "Ready" (as a proxy for both Node being present and the underlying infrastructure being in a running state), but the underlying infra is not in a running state I no longer have a sense for what or how to start remediating the issue.

I don't think we should be thinking of this from the perspective of reversing the state machine as much as providing visibility into the actual state of the world wrt the resources we are managing for the user.

From a troubleshooting perspective, I'd like to ideally get signal from the Cluster, which gives me enough information to follow the breadcrumbs until I get to the actual problematic resource.

@michaelgugino
Copy link
Contributor

underlying resources we are managing.

A node is an overlying resource relative to a machine. That's where a user's focus should be, the health of nodes. If we're talking about disposable, immutable machines, if the node goes unready/unhealthy, you get rid of it. You don't have to care what the layer underneath is doing, that's the entire point.

If I'm looking at Cluster API and it is telling me that my Machine is "Ready"

Yeah, "Ready" doesn't make sense for a machine. Probably the terminal state should be 'Complete' after it becomes a node.

I'd like to ideally get signal from the Cluster

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

@detiber
Copy link
Member

detiber commented Oct 31, 2019

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

@michaelgugino
Copy link
Contributor

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

Remediation always needs to start at the node, automated or otherwise. If you have an unhealthy node, you delete the backing machine-object. If the resolution is any more complicated than that, they'll need to know those provider details either way.

We could build logic into the machine-controller to delete an associated machine if the node object is deleted, but there's other tradeoffs with that (eg, users might not drain the node first, etc).

Let's take Kubernetes out of the equation: Say you have an application deployed across 3 instances in the cloud, and they're deployed behind a load balancer. The load balancer has healthchecks, it detects whether or not the application is up, not he backing instance. You could build automation to watch which endpoints are healthy in the loadbalancer, and replace those instances when they fail.

Back to Kuberentes: In our case, the application we care about is the node. Our application can go down for a variety of reasons, but that's our primary signal. If the instance dies, the application dies with it. If the application can't start due to corruption, well the instance might be alive and well as far as the cloud provider is concerned. It doesn't make any sense to label the cloud-vm itself with the node's status, thus it doesn't make any sense to label the machine-object with the node's status.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2020
@vincepri
Copy link
Member

I think this has been fixed on master / v1alpha3. We reconcile the ready value whenever it changes, can this be closed?

/cc @ncdc

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2020
@vincepri
Copy link
Member

vincepri commented Mar 2, 2020

Closing, fixed in v1alpha3
/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing, fixed in v1alpha3
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

8 participants