Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes 'Marking Degraded' due to old/absent machineconfig #2010

Closed
thomasmeeus opened this issue Aug 19, 2020 · 14 comments
Closed

Nodes 'Marking Degraded' due to old/absent machineconfig #2010

thomasmeeus opened this issue Aug 19, 2020 · 14 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@thomasmeeus
Copy link

Description

I'm seeing this error in the machine-config-daemon of a certain node. The node is unable to apply any new machineconfig and keeps on reverting to this state:

I0819 11:46:52.788904    2602 daemon.go:786] Current config: rendered-master-d98ba7910d2cd8075b71dabc66795966
I0819 11:46:52.788924    2602 daemon.go:787] Desired config: rendered-master-8f54a0f3d11e506499755e2668265a51
E0819 11:46:52.788940    2602 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-b9b68ece3045f2312d3d8c77bd520822" not found

The referenced missing machineconfig (rendered-master-b9b68ece3045f2312d3d8c77bd520822) does not exist anymore in the cluster (we deleted it, trying to solve another issue).

oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                                   4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
00-worker                                                   4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
01-master-container-runtime                                 4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
01-master-kubelet                                           4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
01-worker-container-runtime                                 4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
01-worker-kubelet                                           4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
05-test                                                                                                2.2.0             19m
42-iscsid                                                                                              2.2.0             20h
42-worker-iscsid                                                                                       2.2.0             23h
50-master-chrony                                                                                       2.2.0             5d2h
50-worker-chrony                                                                                       2.2.0             5d2h
60-master-resolv.conf                                                                                  2.2.0             5d2h
60-worker-resolv.conf                                                                                  2.2.0             5d2h
99-master-disable-mitigations                                                                          3.1.0             5d2h
99-master-fad800b4-5918-4089-b23d-ae99a1bbb1ce-registries   4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
99-master-ssh                                                                                          3.1.0             5d2h
99-worker-846a7243-49fb-41d2-9f28-686feb9d7bc2-registries   4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             5d2h
99-worker-disable-mitigations                                                                          3.1.0             5d2h
99-worker-iscsi                                                                                        2.2.0             5d2h
99-worker-ssh                                                                                          3.1.0             5d2h
rendered-master-00af66d3a45843ea00df350dbbb71d6e            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             4d23h
rendered-master-101a915aee509d910a00dbda852f069e            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             20h
rendered-master-22c71a5725b042d19122fe6d8a969196            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             23h
rendered-master-24c979cf07f42dd52c8eb217841a0466            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             164m
rendered-master-334f8bbc728c029ce7b417e94a2f35f8            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             19m
rendered-master-3b059d98289c811f88f0f36ad0af054a            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             172m
rendered-master-4f31b78dfc346336dcc740aaa7aed13d            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d
rendered-master-52fa3518f988a2241b439667aa6e053d            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             20h
rendered-master-571df0a984e9c6fa4f01b308e39e5238            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d2h
rendered-master-6e4b2e5f5aa3c4cb7232f878d2c53abf            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             22h
rendered-master-7c6cc678a4c136bbabf080d3166d3787            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             134m
rendered-master-7fa420c0eb9b7db4e5612391a73b4385            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             130m
rendered-master-8f54a0f3d11e506499755e2668265a51            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             18s
rendered-master-970b8a67d7dec6aa4c6d03b8bb75d00d            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d1h
rendered-master-9bf82c7059025ee8095d0e76afedf45b            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d2h
rendered-master-d98ba7910d2cd8075b71dabc66795966            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             118m
rendered-master-f1222397c6ac67cf2b0e5894b36544c5            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             22h
rendered-worker-1263aacfa0fcca2bb5b79144525c8e0b            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             23h
rendered-worker-1de8f1a7aab14e0c4074fee72a749608            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d1h
rendered-worker-404f5a5361ff56afda0c7bb94c12ea68            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             20h
rendered-worker-4a29907b5a4c35f5276d0dc388116a0e            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d
rendered-worker-5bf2fce87bc129999e6a6f8ba21d8f05            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d2h
rendered-worker-648be372bdec0860faaf1d41df5d78c7            4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb   2.2.0             4d23h
rendered-worker-afd1cc6a69af2d071028140a80f5589a            eaa8be79ec7130d3082afafd30dc5ead1e7d54e7   2.2.0             5d2h

Steps to reproduce the issue:
I think:

  1. Make a mistake in a machineconfig (we tried to write a file to a readonly filesystem)
  2. Try to recover from the mistake above, by patching the machineconfig with a fix and deleting the old machine config
  3. Maybe we acted too fast and removed some stuff while the operator was still busy..

Describe the results you received:
Nodes are unable to fetch new machineconfig. Each action taken results in the same state mentioned above.

Describe the results you expected:
Somehow a way to "remove" the faulty machineconfig (rendered-master-b9b68ece3045f2312d3d8c77bd520822) from the cache or wherever it is stored (etcd)?

Additional information you deem important (e.g. issue happens only occasionally):
Yesterday we encountered the same issue (again, by trying to write a file to a readonly filesystem). We were able to recover by fixing the machineconfig and deleting the old (faulty) one.

Output of oc adm release info --commits | grep machine-config-operator:

oc adm release info --commits | grep machine-config-operator
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        4c5fcbea0bb6694e4192a8dd81ffa471f8731ceb

Additional environment details (platform, options, etc.):

  • Version: 4.5.0-0.okd-2020-08-12-020541
  • We already tried this: https://access.redhat.com/solutions/4970731, without succes (or we're doing something wrong)
  • We also restarted various machine-config pods & rebooted the nodes one by one without succes.
@cgwalters
Copy link
Member

cgwalters commented Sep 22, 2020

EDIT: nevermind, moving the "day 1" version to #2114

@cgwalters
Copy link
Member

There's not an easy way to recover from this right now unfortunately. Basically you shouldn't currently ever try to delete a rendered-<config> if it's being referenced by the operator.

In order to roll back the correct thing is to delete the custom machineconfig you injected - the MCO will retarget the pool to the previous configuration.

@yuqi-zhang
Copy link
Contributor

To add a couple of things:

The referenced missing machineconfig (rendered-master-b9b68ece3045f2312d3d8c77bd520822) does not exist anymore in
the cluster (we deleted it, trying to solve another issue).

Deleting a rendered-machineconfig shouldn't really be necessary to fixing a problem. If a bad rendered config gets generated, there are ways to recover the cluster, and the bad config will never be used again. If you really want to delete it, you must make sure nothing is referencing it. You can triple check by:

  1. making sure the pools are finished (oc get mcp)
  2. making sure there are no references to it in the pool (oc describe mcp/master|worker) and no references to it in the node (oc describe node/nodename -> check current and desiredconfigs). There are other locations that could have references to it but if those are completed it should not show up.

Somehow a way to "remove" the faulty machineconfig (rendered-master-b9b68ece3045f2312d3d8c77bd520822) from the cache or wherever it is stored (etcd)?

We don't have garbage collection today but maybe we will add it at some point. For now its fine because it shouldn't be used after you've deleted the bad machineconfig that generated it.

We already tried this: https://access.redhat.com/solutions/4970731, without succes (or we're doing something wrong)

The steps there might work but it's somewhat case-dependent. The MCD should tell you what the current error is when it comes to a rendered MC issue like this. Recovery should be possible with a mix of: manual node annotation editing, forcing an update by skipping validation, and removing references to the bad config on the node, the pool, and journal (most likely by flushing it, if there's a pending state). I'd advise caution unless you know for sure what the issue is and how to recover

@crawford
Copy link
Contributor

If the Machine Config Operator is unable to detect that a rendered config has been removed, regardless of reason, and regenerate it, then that would be a bug. All operators should be in a constant reconciliation loop and willing to take as much action as necessary to realize the specified config. This clearly looks like a bug in MCC. Am I overlooking something?

@yuqi-zhang
Copy link
Contributor

The MCC would not necessarily have the insight to regenerate a rendered-MC. It's there to render the complete state of the current set of MCs and generate a rendered-config based on that.

I'm pretty sure if you delete the LATEST rendered config it would attempt to regen, when it realizes that there was a change to the MC. (I'll have to double check this).

Bugs like this happen when a user creates a bad MC (which generates a bad rendered-MC), delete that bad MC and then delete that bad rendered-MC before the MCO can properly consolidate. The MCO can no longer generate that rendered MC because it no longer has the bad-MC to generate it from. However, a node may still be referencing it via the desired config annotation, for example.

@cgwalters
Copy link
Member

We could install a finalizer to prevent deletion of any rendered configs which are referenced by a node object.

@crawford
Copy link
Contributor

@yuqi-zhang ah! Yes, I was overlooking that case. Thanks for the explanation. I like @cgwalters' suggestion to use finalizers.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2021
@yuqi-zhang
Copy link
Contributor

/remove-lifecycle rotten

still relevant in the future (probably as an overall rework of some sort eventually)

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 25, 2021
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Jun 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2021

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants