machine controller should sense the event that infra obj turns into notReady #1622

xrmzju · 2019-10-20T02:38:05Z

if the infra obj turns from ready to notReady and not been patched with any ErrorMsg/ErrReason, we only requeue the machine obj and stays in running state.

cluster-api/controllers/machine_controller_phases.go

Line 232 in 0da8bde

    
           return errors.Wrapf(&capierrors.RequeueAfterError{RequeueAfter: externalReadyWait},

if the infra machcine is recoverable(like reboot), i don't think the infra controller should set any ErrorMsg/ErrReason on the infra resource.

/kind bug

ncdc · 2019-10-21T14:07:10Z

I believe this is handled in a different place. The reconcile flow is:

cluster-api/controllers/machine_controller.go

Line 163 in d3140ba

r.reconcileInfrastructure(ctx, m),
cluster-api/controllers/machine_controller_phases.go

Line 209 in d3140ba

infraConfig, err := r.reconcileExternal(ctx, m, &m.Spec.InfrastructureRef)

cluster-api/controllers/machine_controller_phases.go

Lines 135 to 148 in d3140ba

    
           errorReason, errorMessage, err := external.ErrorsFrom(obj) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           if errorReason != "" { 
        
           	machineStatusError := capierrors.MachineStatusError(errorReason) 
        
           	m.Status.ErrorReason = &machineStatusError 
        
           } 
        
           if errorMessage != "" { 
        
           	m.Status.ErrorMessage = pointer.StringPtr( 
        
           		fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s", 
        
           			obj.GroupVersionKind(), obj.GetName(), errorMessage), 
        
           	) 
        
           }

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

xrmzju · 2019-10-21T14:14:22Z

I believe this is handled in a different place. The reconcile flow is:

cluster-api/controllers/machine_controller.go

Line 163 in d3140ba

r.reconcileInfrastructure(ctx, m),

cluster-api/controllers/machine_controller_phases.go

Line 209 in d3140ba

infraConfig, err := r.reconcileExternal(ctx, m, &m.Spec.InfrastructureRef)

cluster-api/controllers/machine_controller_phases.go

Lines 135 to 148 in d3140ba

errorReason, errorMessage, err := external.ErrorsFrom(obj)

if err != nil {

return nil, err

}

if errorReason != "" {

machineStatusError := capierrors.MachineStatusError(errorReason)

m.Status.ErrorReason = &machineStatusError

}

if errorMessage != "" {

m.Status.ErrorMessage = pointer.StringPtr(

fmt.Sprintf("Failure detected from referenced resource %v with name %q: %s",

obj.GroupVersionKind(), obj.GetName(), errorMessage),

)

}

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

sry, i might not express my self clearly, if the infra machine turned into a permanent failure condition, the controller on the infra side should set errMsg/errReason, i agreed with this. but if the infra machine is recoverable(like reboot), the controller on the infra side（in my implemention） would just set the infra machine from ready to not ready, and machine controller in cluster-api could not sense it but stayed in running state.

xrmzju · 2019-10-21T14:17:12Z

and i have this situation discussed with @detiber on slack. he agreed with me too.

ncdc · 2019-10-21T14:17:49Z

Ok, what you're pointing out makes sense. My apologies for being slightly confused 😄.

ncdc · 2019-10-21T14:18:16Z

/help

k8s-ci-robot · 2019-10-21T14:18:17Z

@ncdc:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2019-10-21T14:49:49Z

For context, in v1alpha2 we chose not to recover from a failure case to avoid loops or going back and forth between states outlined in the machine proposal. This behavior aligns with the assumption that Machines are immutable, if you get a failure, the action the system expects an operator to take is to delete and recreate the Machine.

detiber · 2019-10-21T15:18:06Z

@vincepri There are currently plenty of recoverable errors that we encounter today, we just don't necessarily bubble them up in any way. For example if the AWS API is unavailable for any reason.

I don't think we should say that any error should be unrecoverable, nor should we say that we should resolve errors in the controllers. I think the case that is more grey (and potentially where this issue comes into play) is how to bubble up recoverable errors (that are resolved on the cloud provider side) to users without turning them into non-recoverable errors (ErrorMessage/ErrorReason).

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

vincepri · 2019-10-21T17:32:28Z

Yes, we need to distinguish (somehow) between controller-retryable errors and failures.

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

It'll be up to the infrastructure provider to make sure a machine that's being rebooted isn't marked as failed. ~~From CAPI perspective, a Machine that's being rebooted, it'll cause the Node to become not ready and fall from Running state to Provisioned.~~

Nevermind, https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller_phases.go#L60-L63 the code actually doesn't check for node readiness.

ncdc · 2019-10-23T17:36:03Z

@vincepri to look up what we wrote in the v1a2 proposal around ready->notReady transitions.

detiber · 2019-10-23T17:45:40Z

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

vincepri · 2019-10-23T18:25:28Z

look up what we wrote in the v1a2 proposal around ready->notReady transitions.

If we allow to flip from ready -> notReady, the phase would go back to provisioning as outlined in the logic here

cluster-api/controllers/machine_controller_phases.go

Line 51 in d3140ba

if m.Status.BootstrapReady && !m.Status.InfrastructureReady {

I'm not super opposed to this behavior, but not in total favor either. I'd rather have the Machine controller detect that the linked node is not in a ready state or not by checking the NodeRef and flip the status to Provisioned. Which is in line with the state conditions outlined in the original proposal https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md#transition-conditions-3.

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

I'd like to avoid, at least for now, to keep track of previous states and introduce something like Conditions to understand how we arrived to a state. This would need to be in a separate CAEP, wdyt?

ncdc · 2019-10-23T19:15:28Z

👍 for checking the node ref and flipping it back to provisioned for the time being, and doing a separate proposal around conditions (hint hint @timothysc 😄)

nader-ziada · 2019-10-29T20:23:59Z

/assign

I'll work on this issue if nobody has any concerns.

michaelgugino · 2019-10-31T14:13:15Z

I'm wholly against reversing machine states. The flow of the machine-controller and associated infra controller should be one-way.

If some machine that was previously ready (eg, became a node) has now failed, we'll see that indication on the node, and we remediate:

Delete the machine-object
Machine-controller cordon and drains node
Backing instance is deleted from cloud
MachineSet or other thing creates a replacement machine

Immutable infrastructure.

detiber · 2019-10-31T14:47:56Z

For me this is less about reversing states as much as giving users visibility into the underlying resources we are managing.

From a user's point of view (not taking into account automated remediation for the moment). If I'm looking at Cluster API and it is telling me that my Machine is "Ready" (as a proxy for both Node being present and the underlying infrastructure being in a running state), but the underlying infra is not in a running state I no longer have a sense for what or how to start remediating the issue.

I don't think we should be thinking of this from the perspective of reversing the state machine as much as providing visibility into the actual state of the world wrt the resources we are managing for the user.

From a troubleshooting perspective, I'd like to ideally get signal from the Cluster, which gives me enough information to follow the breadcrumbs until I get to the actual problematic resource.

michaelgugino · 2019-10-31T15:08:48Z

underlying resources we are managing.

A node is an overlying resource relative to a machine. That's where a user's focus should be, the health of nodes. If we're talking about disposable, immutable machines, if the node goes unready/unhealthy, you get rid of it. You don't have to care what the layer underneath is doing, that's the entire point.

If I'm looking at Cluster API and it is telling me that my Machine is "Ready"

Yeah, "Ready" doesn't make sense for a machine. Probably the terminal state should be 'Complete' after it becomes a node.

I'd like to ideally get signal from the Cluster

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

detiber · 2019-10-31T15:16:47Z

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

michaelgugino · 2019-10-31T16:32:10Z

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

Remediation always needs to start at the node, automated or otherwise. If you have an unhealthy node, you delete the backing machine-object. If the resolution is any more complicated than that, they'll need to know those provider details either way.

We could build logic into the machine-controller to delete an associated machine if the node object is deleted, but there's other tradeoffs with that (eg, users might not drain the node first, etc).

Let's take Kubernetes out of the equation: Say you have an application deployed across 3 instances in the cloud, and they're deployed behind a load balancer. The load balancer has healthchecks, it detects whether or not the application is up, not he backing instance. You could build automation to watch which endpoints are healthy in the loadbalancer, and replace those instances when they fail.

Back to Kuberentes: In our case, the application we care about is the node. Our application can go down for a variety of reasons, but that's our primary signal. If the instance dies, the application dies with it. If the application can't start due to corruption, well the instance might be alive and well as far as the cloud provider is concerned. It doesn't make any sense to label the cloud-vm itself with the node's status, thus it doesn't make any sense to label the machine-object with the node's status.

fejta-bot · 2020-01-29T17:28:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-01-29T17:29:36Z

I think this has been fixed on master / v1alpha3. We reconcile the ready value whenever it changes, can this be closed?

/cc @ncdc

fejta-bot · 2020-02-28T18:10:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

vincepri · 2020-03-02T18:23:35Z

Closing, fixed in v1alpha3
/close

k8s-ci-robot · 2020-03-02T18:23:37Z

@vincepri: Closing this issue.

In response to this:

Closing, fixed in v1alpha3
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2019

ncdc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 21, 2019

ncdc added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Oct 21, 2019

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 21, 2019

ncdc added this to the Next milestone Oct 23, 2019

k8s-ci-robot assigned nader-ziada Oct 29, 2019

nader-ziada mentioned this issue Nov 4, 2019

⚠️ Set Machine phase to provisioned regardless of infrastructure readiness #1720

Merged

xrmzju mentioned this issue Dec 11, 2019

REQUEST: New membership for xrmzju kubernetes/org#1477

Closed

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2020

k8s-ci-robot closed this as completed Mar 2, 2020

Arvinderpal mentioned this issue Jul 21, 2021

CAPM3 sets failureReason/failureMessage for recoverable errors metal3-io/cluster-api-provider-metal3#255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machine controller should sense the event that infra obj turns into notReady #1622

machine controller should sense the event that infra obj turns into notReady #1622

xrmzju commented Oct 20, 2019

ncdc commented Oct 21, 2019

xrmzju commented Oct 21, 2019

xrmzju commented Oct 21, 2019

ncdc commented Oct 21, 2019

ncdc commented Oct 21, 2019

k8s-ci-robot commented Oct 21, 2019

vincepri commented Oct 21, 2019

detiber commented Oct 21, 2019

vincepri commented Oct 21, 2019 •

edited

Loading

ncdc commented Oct 23, 2019

detiber commented Oct 23, 2019

vincepri commented Oct 23, 2019

ncdc commented Oct 23, 2019

nader-ziada commented Oct 29, 2019

michaelgugino commented Oct 31, 2019

detiber commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

detiber commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

fejta-bot commented Jan 29, 2020

vincepri commented Jan 29, 2020

fejta-bot commented Feb 28, 2020

vincepri commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

machine controller should sense the event that infra obj turns into notReady #1622

machine controller should sense the event that infra obj turns into notReady #1622

Comments

xrmzju commented Oct 20, 2019

ncdc commented Oct 21, 2019

xrmzju commented Oct 21, 2019

xrmzju commented Oct 21, 2019

ncdc commented Oct 21, 2019

ncdc commented Oct 21, 2019

k8s-ci-robot commented Oct 21, 2019

vincepri commented Oct 21, 2019

detiber commented Oct 21, 2019

vincepri commented Oct 21, 2019 • edited Loading

ncdc commented Oct 23, 2019

detiber commented Oct 23, 2019

vincepri commented Oct 23, 2019

ncdc commented Oct 23, 2019

nader-ziada commented Oct 29, 2019

michaelgugino commented Oct 31, 2019

detiber commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

detiber commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

fejta-bot commented Jan 29, 2020

vincepri commented Jan 29, 2020

fejta-bot commented Feb 28, 2020

vincepri commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

vincepri commented Oct 21, 2019 •

edited

Loading