Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

EM: Add restart on-failure for metadata service #1362

Merged
merged 4 commits into from
Feb 12, 2021

Conversation

surajssd
Copy link
Member

@surajssd surajssd commented Feb 4, 2021

This PR adds Restart=on-failure and RestartSec=5s to the metadata service.

Fixes #1298


The other services which have Type=oneshot are following: wait-for-dns, bootkube, delete-node, create-etcd-config, persist-data-raid.

Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.

@knrt10
Copy link
Member

knrt10 commented Feb 4, 2021

Seeing this PR, I think we should slowly start changing Packet -> Equinx Metal

@invidian
Copy link
Member

invidian commented Feb 5, 2021

Seeing this PR, I think we should slowly start changing Packet -> Equinx Metal

We have #1060 :)

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other services which have Type=oneshot are following: wait-for-dns, bootkube, delete-node, create-etcd-config, persist-data-raid.

Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.

If there is a DNS outage, I think we should retry wait-for-dns.service, as it will block starting kubelet.

delete-node.service we should also retry IMO, so the node does not go away unregistered. As the pods will stay assigned for long on this node and won't be re-scheduled.

2 points above also applies to other platforms, not only Packet.

This commit adds `Restart=on-failure` and `RestartSec=10s` to the
metadata service.

Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
's | Flatcar | Flatcar Container Linux | g'

Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
This commit adds `Restart=on-failure` and `RestartSec=5s` to the
wait-for-dns service on all platforms.

Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
This commit adds `Restart=on-failure` and `RestartSec=5s` to the
delete-node service on all platforms.

Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
@surajssd
Copy link
Member Author

surajssd commented Feb 8, 2021

If there is a DNS outage, I think we should retry wait-for-dns.service, as it will block starting kubelet.
delete-node.service we should also retry IMO, so the node does not go away unregistered. As the pods will stay assigned for long on this node and won't be re-scheduled.

I don't know what the unforeseen consequences will be for adding retries for wait-for-dns and delete-node, but lets try. I can only see them doing indefinite attempt to reconcile leaving user stranded either at installation or while removing the node.

@surajssd
Copy link
Member Author

surajssd commented Feb 9, 2021

I can wait until #1368 is merged.

@pothos
Copy link
Member

pothos commented Feb 11, 2021

One more thing to consider: The oneshot services will be triggered again when the service that depends on it is restarted. I would add RemainAfterExit=yes everywhere, too (but for the resolv.conf waiter it seems harmless if it's triggered on a kubelet restart).

@invidian
Copy link
Member

Good point @pothos.

I can wait until #1368 is merged.

This is now merged.

@surajssd surajssd merged commit 0b0c39c into master Feb 12, 2021
@surajssd surajssd deleted the surajssd/restart-on-failure branch February 12, 2021 14:32
@invidian
Copy link
Member

@surajssd did you see @pothos comment before merging? Did you verify what he suggests?

@surajssd
Copy link
Member Author

@surajssd did you see @pothos comment before merging? Did you verify what he suggests?

Yep, I thought it is to be done in a separate PR. This is now merged. kinda hinted me towards it is good to merge this one.

@surajssd
Copy link
Member Author

surajssd commented Feb 12, 2021

Creating a new issue for what Kai has suggested.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Restart=on-failure for maintained systemd units to make them more robust
4 participants