Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Update Machine Status with Error Reason and Error Message if infra object is deleted #1458

Merged

Conversation

kashifest
Copy link
Contributor

What this PR does / why we need it:
Updates the machine status with error reason and error message when machine is in running phase and the infrastructure object is deleted
Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1398

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 27, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @kashifest. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 27, 2019
@kashifest
Copy link
Contributor Author

/assign @ncdc

@@ -78,6 +78,13 @@ func (r *MachineReconciler) reconcileExternal(ctx context.Context, m *clusterv1.
obj, err := external.Get(r.Client, ref, m.Namespace)
if err != nil {
if apierrors.IsNotFound(err) {
if m.Status.InfrastructureReady && m.Status.GetTypedPhase() == clusterv1.MachinePhaseRunning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have logic comparing against the Machine "Phase", it is meant to only be a user friendly display value.

This check should also probably be pushed up to where reconcileExternal is called for the InfrastructureRef, since a missing Bootstrap.ConfigRef should not result in an error.

Copy link
Contributor Author

@kashifest kashifest Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the machine phase to make sure this block of code is executed only once. If it is not recommended, I can remove that

Copy link
Contributor

@ncdc ncdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we get it sorted where this logic should live, could you please make sure to add test case(s) for it? We have e.g. TestReconcileInfrastructure and TestReconcileBootstrap.

@@ -78,6 +78,13 @@ func (r *MachineReconciler) reconcileExternal(ctx context.Context, m *clusterv1.
obj, err := external.Get(r.Client, ref, m.Namespace)
if err != nil {
if apierrors.IsNotFound(err) {
if m.Status.InfrastructureReady && m.Status.GetTypedPhase() == clusterv1.MachinePhaseRunning {
// Infra object went missing after the machine was up and running
klog.Infof("Infrastructure object %v %q went missing. Updating Machine Status", ref.GroupVersionKind(), ref.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add any new raw klog calls. Please use r.Log instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -78,6 +78,13 @@ func (r *MachineReconciler) reconcileExternal(ctx context.Context, m *clusterv1.
obj, err := external.Get(r.Client, ref, m.Namespace)
if err != nil {
if apierrors.IsNotFound(err) {
if m.Status.InfrastructureReady && m.Status.GetTypedPhase() == clusterv1.MachinePhaseRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not check the phase - that is for human consumption. Checking InfrastructureReady is sufficient.

If we place this logic here, it applies to all external references - both the bootstrap ref and the infrastructure ref. @vincepri do you this is appropriate?

Copy link
Contributor Author

@kashifest kashifest Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the machine phase to make sure this block of code is executed only once. If it is not recommended, I can remove that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not currently short-circuiting machine reconciliation if the error reason/error message are set, we need to be doing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc I am not sure if I understood correctly, this check was allowing to setting the error reason/message and logging only once, but it did not short-circuit reconciliation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would it be executing this more than once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In each call to reconcileExternal in my current version of the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, once you move this to reconcileInfrastructure, I think we also need to add code toward the top of Reconcile that checks machine.status.errorReason/errorMessage, and if either is set, it stops. This would need to be before the call to patch.NewHelper. Do you agree @vincepri?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the short-circuit behavior is definitely missing. Let's follow-up with a different PR marked with ⚠️ for that

controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
if m.Status.InfrastructureReady && m.Status.GetTypedPhase() == clusterv1.MachinePhaseRunning {
// Infra object went missing after the machine was up and running
klog.Infof("Infrastructure object %v %q went missing. Updating Machine Status", ref.GroupVersionKind(), ref.Name)
machineStatusError := capierrors.MachineStatusError("Invalid Configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri do we have any guidance on the content and format for error reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid creating new ErrorReason(s) dynamically. We can add new ones to the constants defined in the package. Let's focus on having a good error message in ErrorMessage though

klog.Infof("Infrastructure object %v %q went missing. Updating Machine Status", ref.GroupVersionKind(), ref.Name)
machineStatusError := capierrors.MachineStatusError("Invalid Configuration")
m.Status.ErrorReason = &machineStatusError
m.Status.ErrorMessage = pointer.StringPtr("Invalid Configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be descriptive and include the GVK (GroupVersionKind) and name of the resource that went missing.

Copy link
Contributor Author

@kashifest kashifest Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@kashifest
Copy link
Contributor Author

@ncdc I can add the test cases also. Now I am waiting to know where should I put the check in reconcileExternal or reoncileInfrastructure

@ncdc
Copy link
Contributor

ncdc commented Sep 27, 2019

+1 to reconcileInfrastructure

@vincepri
Copy link
Member

Before merging, let's find a PR title that communicates better the changes we're making. PR titles go in release notes for users to read

@vincepri
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 27, 2019
@kashifest kashifest changed the title 🐛 Update Machine Status if infra object is deleted 🐛 Update Machine Status with Error Reason and Error Message if infra object is deleted Sep 27, 2019
@kashifest kashifest force-pushed the bugfix/infraObject-deletion-Kashif branch 2 times, most recently from 29f3724 to 166cc55 Compare September 27, 2019 21:03
@kashifest
Copy link
Contributor Author

@ncdc @vincepri @detiber I have addressed most of the issues commented above. I did not move the code to reconcileInfrastructure though, since reconcileExternal changes the error cause for reconciliation. This makes it a bit tricky to get the actual error in ```reconcileInfrastructure``.

@kashifest kashifest changed the title 🐛 Update Machine Status with Error Reason and Error Message if infra object is deleted 🐛 Update Machine Status with Error Reason and Error Message if infra/bootstrap object is deleted Sep 27, 2019
@kashifest kashifest force-pushed the bugfix/infraObject-deletion-Kashif branch 3 times, most recently from b779cfb to 0a59225 Compare September 30, 2019 14:18
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 30, 2019
@kashifest kashifest changed the title 🐛 Update Machine Status with Error Reason and Error Message if infra/bootstrap object is deleted 🐛 Update Machine Status with Error Reason and Error Message if infra object is deleted Sep 30, 2019
@@ -197,6 +197,14 @@ func (r *MachineReconciler) reconcileInfrastructure(ctx context.Context, m *clus
if infraConfig == nil && err == nil {
return nil
} else if err != nil {
if m.Status.InfrastructureReady {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also test apierrors.IsNotFound(errors.Cause(err)), to ensure that the error is related to the resource being missing rather than another type of error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. It seems reconcileExternal is consuming that error and only returning the requeue error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrgh, indeed it is... I'm wondering if we need to refactor reconcileExternal a bit so that we are not returning a wrapped RequeueAfter error there.... It would probably require registering the watcher for the resource prior to actually retrieving one, though... @vincepri @ncdc thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could either move the requeue error up to the calling functions, or (less preferred, but quicker) string match "not found" on the incoming error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri @detiber I chose the quicker but not preferred way ... the placement of the requeue error in calling function is bit of a design issue since it also requires change in reconcileBootstrap , which I thought is bit out of scope of this bug, shall we follow this in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this as a short term solution as long as a tracking issue is created to clean up later.

controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
controllers/machine_controller_phases.go Outdated Show resolved Hide resolved
@kashifest kashifest force-pushed the bugfix/infraObject-deletion-Kashif branch from 0a59225 to 160b664 Compare September 30, 2019 18:55
// Infra object went missing after the machine was up and running
r.Log.Error(err, "Machine infrastructure reference has been deleted after being ready, setting failure state")
machineStatusError := capierrors.InvalidConfigurationMachineError
m.Status.ErrorReason = &machineStatusError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m.Status.ErrorReason = &machineStatusError
m.Status.ErrorReason = capierrors.ClusterStatusErrorPtr(capierrors.InvalidConfigurationMachineError)

Copy link
Contributor Author

@kashifest kashifest Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri Thanks. Didn't know this exist. Shouldn't it be MachineStatusErrorPtr instead of ClusterStatusErrorPtr

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/assign @detiber

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kashifest, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2019
@detiber
Copy link
Member

detiber commented Sep 30, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2019
@k8s-ci-robot k8s-ci-robot merged commit 3046e2f into kubernetes-sigs:master Sep 30, 2019
@kashifest kashifest deleted the bugfix/infraObject-deletion-Kashif branch October 1, 2019 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

machine should not stay in running phase after the infra obj been deleted
5 participants