EM: Add restart on-failure for metadata service #1362

surajssd · 2021-02-04T12:39:36Z

This PR adds Restart=on-failure and RestartSec=5s to the metadata service.

The other services which have Type=oneshot are following: wait-for-dns, bootkube, delete-node, create-etcd-config, persist-data-raid.

Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.

knrt10 · 2021-02-04T12:51:45Z

Seeing this PR, I think we should slowly start changing Packet -> Equinx Metal

invidian · 2021-02-05T13:19:20Z

Seeing this PR, I think we should slowly start changing Packet -> Equinx Metal

We have #1060 :)

invidian

The other services which have Type=oneshot are following: wait-for-dns, bootkube, delete-node, create-etcd-config, persist-data-raid.

Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.

If there is a DNS outage, I think we should retry wait-for-dns.service, as it will block starting kubelet.

delete-node.service we should also retry IMO, so the node does not go away unregistered. As the pods will stay assigned for long on this node and won't be re-scheduled.

2 points above also applies to other platforms, not only Packet.

assets/terraform-modules/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl

This commit adds `Restart=on-failure` and `RestartSec=10s` to the metadata service. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

's | Flatcar | Flatcar Container Linux | g' Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

This commit adds `Restart=on-failure` and `RestartSec=5s` to the wait-for-dns service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

This commit adds `Restart=on-failure` and `RestartSec=5s` to the delete-node service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

surajssd · 2021-02-08T10:11:38Z

If there is a DNS outage, I think we should retry wait-for-dns.service, as it will block starting kubelet.
delete-node.service we should also retry IMO, so the node does not go away unregistered. As the pods will stay assigned for long on this node and won't be re-scheduled.

I don't know what the unforeseen consequences will be for adding retries for wait-for-dns and delete-node, but lets try. I can only see them doing indefinite attempt to reconcile leaving user stranded either at installation or while removing the node.

surajssd · 2021-02-09T09:56:16Z

I can wait until #1368 is merged.

pothos · 2021-02-11T20:59:09Z

One more thing to consider: The oneshot services will be triggered again when the service that depends on it is restarted. I would add RemainAfterExit=yes everywhere, too (but for the resolv.conf waiter it seems harmless if it's triggered on a kubelet restart).

invidian · 2021-02-12T13:03:38Z

Good point @pothos.

I can wait until #1368 is merged.

This is now merged.

invidian · 2021-02-12T14:33:53Z

@surajssd did you see @pothos comment before merging? Did you verify what he suggests?

surajssd · 2021-02-12T14:35:49Z

@surajssd did you see @pothos comment before merging? Did you verify what he suggests?

Yep, I thought it is to be done in a separate PR. This is now merged. kinda hinted me towards it is good to merge this one.

surajssd · 2021-02-12T14:36:18Z

Creating a new issue for what Kai has suggested.

surajssd requested review from iaguis and invidian February 4, 2021 12:39

invidian suggested changes Feb 5, 2021

View reviewed changes

assets/terraform-modules/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl Outdated Show resolved Hide resolved

surajssd added 4 commits February 8, 2021 15:40

EM: Add restart on-failure for metadata service

e4d3aea

This commit adds `Restart=on-failure` and `RestartSec=10s` to the metadata service. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

EM: Update name to Flatcar Container Linux

1d60eb1

's | Flatcar | Flatcar Container Linux | g' Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

wait-for-dns: Add restart on-failure and restart sec

131ee9a

This commit adds `Restart=on-failure` and `RestartSec=5s` to the wait-for-dns service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

delete-node: Add restart on-failure and restart sec

08d342f

This commit adds `Restart=on-failure` and `RestartSec=5s` to the delete-node service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>

surajssd force-pushed the surajssd/restart-on-failure branch from f0813be to 08d342f Compare February 8, 2021 10:11

surajssd requested a review from invidian February 9, 2021 07:36

invidian approved these changes Feb 9, 2021

View reviewed changes

surajssd merged commit 0b0c39c into master Feb 12, 2021

surajssd deleted the surajssd/restart-on-failure branch February 12, 2021 14:32

surajssd mentioned this pull request Feb 12, 2021

Add RemainAfterExit=yes for oneshot services #1371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EM: Add restart on-failure for metadata service #1362

EM: Add restart on-failure for metadata service #1362

surajssd commented Feb 4, 2021

knrt10 commented Feb 4, 2021

invidian commented Feb 5, 2021

invidian left a comment

surajssd commented Feb 8, 2021

surajssd commented Feb 9, 2021

pothos commented Feb 11, 2021

invidian commented Feb 12, 2021

invidian commented Feb 12, 2021

surajssd commented Feb 12, 2021

surajssd commented Feb 12, 2021 •

edited

Loading

EM: Add restart on-failure for metadata service #1362

EM: Add restart on-failure for metadata service #1362

Conversation

surajssd commented Feb 4, 2021

knrt10 commented Feb 4, 2021

invidian commented Feb 5, 2021

invidian left a comment

Choose a reason for hiding this comment

surajssd commented Feb 8, 2021

surajssd commented Feb 9, 2021

pothos commented Feb 11, 2021

invidian commented Feb 12, 2021

invidian commented Feb 12, 2021

surajssd commented Feb 12, 2021

surajssd commented Feb 12, 2021 • edited Loading

surajssd commented Feb 12, 2021 •

edited

Loading