baremetal: improve debuggability of ipi deployments #328

stbenjam · 2020-05-15T17:22:00Z

The goal of this enhancement is to improve the day 1 install experience
and reduce the perception of complexity in baremetal IPI deployments.

The goal of this enhancement is to improve the day 1 install experience and reduce the perception of complexity in baremetal IPI deployments.

openshift-ci-robot · 2020-05-15T17:22:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stbenjam
To complete the pull request process, please assign enxebre
You can assign the PR to them by writing /assign @enxebre in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stbenjam · 2020-05-15T17:37:21Z

@dtantsur @hardys @juliakreger @markmc @sadasu Please have a look when you have a moment.

stbenjam · 2020-05-15T17:38:54Z

@abhinavdahiya I know you've been doing some work on this topic already for the general case, would appreciate your thoughts, especially about the parts that impact the installer outside of our platform.

@enxebre I'd appreciate your thoughts as well, especially in regards to worker deployment and how we can get information about failures to the installer.

abhinavdahiya · 2020-05-15T17:47:29Z

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md

+When this happens, the installer times out, reports to the user a large
+number of operators failed to roll out, and no useful context about what
+to do or why the operators failed.


that's on the operator owners to make sure the errors are clear. Think about how it is not only installer that is the consumer of these message, but also admins during upgrades.

So personally the goal should be to ensure that each operator is responsible for using clear error messages in the status.

I agree, it's just on day 1 the worker deployment failure seems to be special to me. It causes a lot of noise as a bunch of operators start reporting error messages that make it hard to point to a root cause unless you've seen the problem before. I don't think machine-api operator even reports anything useful when this happens, but if it did, it'd get lost in mix of the many other failing operators.

If we take ingress/console operator as an example - if the worker fails, they will be in error - but, from my experience of installing openshift for first few times - the user will have no idea that this is the reason. He will just see that those operators are down.
What is possible maybe to do is to have some kind of 'validators' - either from the installer binary or as an operator - that can analyze logs or cluster runtime state (with minimal requirement for cluster functionality - such as passwordless ssh between nodes) that can look into the state of the cluster and explain the user what went wrong. If we provide an infra for writing those validators, then operators owners / qe / intergration team will be able to enhance those once they rn into an issue that it was hard to analyze.

If they asked for a certain number of works and they didn't get them that seems reasonable to have a special error for that.

I also think some generic orientation regarding how to investigate operators failing may help as well. They'll need to learn that skill eventually no matter what. So documenting how to look at an Operator's status and referencing that seems to be something worth doing no matter what. Ingress for example often tells you that the dns entry doesn't exist but people don't even know where to look for that.

abhinavdahiya · 2020-05-15T17:52:15Z

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md

+The installer has a feature for log gathering on bootstrap failure that
+does not work on baremetal. This should be the first priority, but even
+in this case a user still needs to look into an archive containing many
+logs to identify a failure.
+
+Ideally there would be some mechanism to identify and extract useful
+information and display it to the user.


openshift/installer#2569
^^ already looking at making these problems more easy to report in the long term.

For now the installer now has list of common failures and how to identity them in https://github.com/openshift/installer/blob/master/docs/user/troubleshootingbootstrap.md#common-failures
the goal is to curate a list of detectable failures and then automatically do it as part of analysis.

the initial approach in 2569 was that, you show users most failure logs from the bundle and let them decide for themselves, but personally i would like us to come up with common known failure list and then just show this was the error, and here's how you might resolve this.

abhinavdahiya · 2020-05-15T17:58:32Z

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md

+#### Infrastructure Automation (Terraform)
+
+Baremetal IPI relies on terraform to provision a libvirt bootstrap
+virtual machine, and bare metal control plane hosts. We use
+terraform-provider-libvirt and terraform-provider-ironic to accomplish
+those goals.
+
+terraform-provider-ironic reports failures when it cannot reach the
+Ironic API, or a control plane host fails to provision. In both cases,
+we do not provide useful information to the user about what to do.


I think ironic provider should provide clear error messages. any effort we put into this means the users of installer and upstream benefit from the effort.

i have some in flight, Bug 1837564: pkg/terraform: add diagnostics errors for terraform apply operations installer#3535 and we could expand those if we like.

This is exactly the combination that we need, thanks -- I think we can improve the terraform error messages for the general case, and use 3535 to add OpenShift context where appropriate.

romfreiman · 2020-05-17T05:57:21Z

General comment: lets take into account how the assisted installer as well - the ironic part is irrelevant of course, but might be other parts

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md

…nt.md Co-authored-by: Russell Bryant <russell@russellbryant.net>

Co-authored-by: Russell Bryant <russell@russellbryant.net>

sadasu · 2020-05-19T15:49:01Z

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md

+workers, it is currently only shown on the `BareMetalHost` resource or
+by examining baremetal-operator logs. Failure to deploy a worker should
+be reflected by marking either the `machine-api-operator` or the future
+`cluster-baremetal-operator` degraded.


We have to make sure this behavior is similar to behavior for other platforms. Should the MAO or CBO be marked degraded when 1 worker fails to deploy? What if that is an issue with the worker itself? How can we distinguish between errors in the resource being provisioned verses an issue in the control plane?

These are all very good questions! It would be helpful to understand from someone in MAO (maybe @enxebre) how it works for other platforms, but my understanding is MAO doesn't go degraded from this case.

I think that MAO should show degraded if replicas are not met, or at the very least, if replicas are < 2, since we know we need 2 to get a working cluster on day 1 (unless controlplane is scheduable).

Perhaps cluster-operator-baremetal should also go degraded if provisioning fails, with more specific error messages bubbled up from baremetal-operator/ironic.

After a discussion with the MAO team this is what I learnt. The MAO would go into a degraded state only when the pods that it is responsible for deploying, fail to come up. When a resource it manages, in this case a worker Machine, does not come up, the MAO does not go into a failed/degraded state for other platforms and should probably be the same for baremetal too.
When the initial deployment with 2 workers fails, then it should be considered an Installer error and we should provide the best/detailed errors message we can provide by bubbling up what we can get from BMO and/or Ironic. The MAO team believes that this is not a reason to put the operator in a degraded state.
Since the baremetal platform is special, we could come up with a semantic on day 2 where if we notice a large number of worker failures (for example 90% of workers are not coming up), then the aggregated bad state could result in the operator being put in the degraded state. Currently, the operator does not have an aggregated view, so that needs to be added to the SLO at a future time.

After a discussion with the MAO team this is what I learnt. The MAO would go into a degraded state only when the pods that it is responsible for deploying, fail to come up. When a resource it manages, in this case a worker Machine, does not come up, the MAO does not go into a failed/degraded state for other platforms and should probably be the same for baremetal too.

Yea, it doesn't go to a degraded state for anyone and I think that's a mistake and what I'm proposing to change here.

When the initial deployment with 2 workers fails, then it should be considered an Installer error and we should provide the best/detailed errors message we can provide by bubbling up what we can get from BMO and/or Ironic. The MAO team believes that this is not a reason to put the operator in a degraded state.

The main way the installer gets information about deployment success is largely through operator states via CVO, it would need a special case to count workers meeting the requested number of replicas.

If MAO can't provision machines, why is that not a degraded state of the operator?

CC: @abhinavdahiya

@stbenjam This docmentation https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#what-should-an-operator-report-with-clusteroperator-custom-resource helped me understand the semantics behind different Operator states.

Failure to deploy a worker should
be reflected by marking either the machine-api-operator or the future
cluster-baremetal-operator degraded.

There are countless transient scenarios for "Failure to deploy a worker". This makes impractical putting a reasonable generic semantic on top of it. And so this makes worthless to let the overall operator going degraded in such a heterogeneous scenario.

Instead I believe the boundaries to signal the details of theses errors belong to individual machine resource conditions and any lower level resource. Just like we do for any other provider https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40

Then to communicate "Failure to deploy a worker" We already trigger alerts any time a machine has no node regardless of the failure details and regardless the provider. So each failure details can then be analysed in the format described above.

Beyond all the above, regardless of the failure details and based on the overall health of the cluster (e.g 99 out 100 machines has no node) we might decide our criteria for a semantic that represents a permanent global issue and choose to let the mao going degraded in that case. But that's a separate scoped discussion.

We already trigger alerts any time a machine has no node regardless of the failure details and regardless the provider.

These don't show up in the installer output, and as far as I know they do try to capture alerts.

If you've ever done an install and ended up with a non-viable cluster due to insufficient workers, the UX is unacceptable. You get a report of a dozen failing operators and absolutely no indication it's because you don't have enough workers. @sdodson previously mentioned maybe we could do something in the installer about it (#328 (comment)), which may help the problem I guess, but doesn't feel like the right solution to me as machine-api-operator, being the top-level operator for dealing with machines, should be signaling clearly about the problem.

These don't show up in the installer output, and as far as I know they do try to capture alerts.

That'd be then a very a specific issue: "Installer output not capturing some existing alerts as expected".

ended up with a non-viable cluster due to insufficient workers,

I agree. That scenario should be covered by my last statement in the previous comment.

- Open questions seem to be resolved - Rename bootstrap -> bootstrap host

openshift-bot · 2020-10-28T07:24:11Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-11-27T09:17:55Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-12-27T11:07:46Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-12-27T11:08:00Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

baremetal: improve debuggability of ipi deployments

a3bf88f

The goal of this enhancement is to improve the day 1 install experience and reduce the perception of complexity in baremetal IPI deployments.

openshift-ci-robot requested review from jwforres and kbsingh May 15, 2020 17:22

abhinavdahiya reviewed May 15, 2020

View reviewed changes

jparrill reviewed May 18, 2020

View reviewed changes

enhancements/baremetal/debuggability-of-baremetal-ipi-deployment.md Show resolved Hide resolved

russellb reviewed May 18, 2020

View reviewed changes

stbenjam and others added 4 commits May 19, 2020 10:09

Update enhancements/baremetal/debuggability-of-baremetal-ipi-deployme…

45ad3cd

…nt.md Co-authored-by: Russell Bryant <russell@russellbryant.net>

Apply suggestions from code review

0b73120

Co-authored-by: Russell Bryant <russell@russellbryant.net>

Add more see-also references, and other clean-ups

8b05817

Clarify worker deployments should make operator degraded

1440b6f

sadasu reviewed May 19, 2020

View reviewed changes

Address comments

545d738

- Open questions seem to be resolved - Rename bootstrap -> bootstrap host

wking mentioned this pull request May 29, 2020

RFE: Automatically analyze gathered bootstrap logs? openshift/installer#2569

Open

stbenjam mentioned this pull request Jun 4, 2020

Bug 1843314: baremetal: bump ironic timeout to 3600 seconds openshift/installer#3721

Merged

stbenjam mentioned this pull request Aug 24, 2020

Bug 1816904: Ensure the Installer checks for the correct number of compute replicas before exiting openshift/installer#4071

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 27, 2020

openshift-ci-robot closed this Dec 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baremetal: improve debuggability of ipi deployments #328

baremetal: improve debuggability of ipi deployments #328

stbenjam commented May 15, 2020

openshift-ci-robot commented May 15, 2020

stbenjam commented May 15, 2020

stbenjam commented May 15, 2020 •

edited

Loading

abhinavdahiya May 15, 2020

stbenjam May 15, 2020

romfreiman May 17, 2020

sdodson May 19, 2020 •

edited

Loading

abhinavdahiya May 15, 2020

abhinavdahiya May 15, 2020

stbenjam May 15, 2020

romfreiman commented May 17, 2020

sadasu May 19, 2020

stbenjam May 19, 2020

sadasu May 26, 2020

stbenjam May 26, 2020

sadasu Jun 5, 2020

enxebre Jul 16, 2020

stbenjam Jul 16, 2020 •

edited

Loading

enxebre Jul 16, 2020

openshift-bot commented Oct 28, 2020

openshift-bot commented Nov 27, 2020

openshift-bot commented Dec 27, 2020

openshift-ci-robot commented Dec 27, 2020

baremetal: improve debuggability of ipi deployments #328

baremetal: improve debuggability of ipi deployments #328

Conversation

stbenjam commented May 15, 2020

openshift-ci-robot commented May 15, 2020

stbenjam commented May 15, 2020

stbenjam commented May 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson May 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romfreiman commented May 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Oct 28, 2020

openshift-bot commented Nov 27, 2020

openshift-bot commented Dec 27, 2020

openshift-ci-robot commented Dec 27, 2020

stbenjam commented May 15, 2020 •

edited

Loading

sdodson May 19, 2020 •

edited

Loading

stbenjam Jul 16, 2020 •

edited

Loading