Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support watchdog coverage for first phase of reboot #71

Open
tonyespy opened this issue Jul 7, 2022 · 2 comments
Open

Support watchdog coverage for first phase of reboot #71

tonyespy opened this issue Jul 7, 2022 · 2 comments

Comments

@tonyespy
Copy link

tonyespy commented Jul 7, 2022

We currently support two system configuration options (since snapd 2.34) that control the behavior of systems with hardware watchdog timers:

  • watchdog.runtime-timeout
  • watchdog.shutdown-timeout

The documentation for shutdown-timeout says:

The watchdog shutdown timeout is an interval to permit a clean reboot of the system. If the system fails to reboot within this interval, the watchdog will forcibly restart the system to protect against failed or hanging reboots.

This is slightly misleading, as this timeout is used to reset the hardware watchdog for each iteration of the main loop w/in systemd-shutdown, so in reality the reboot could take much longer than this timeout.

Note that the shutdown-timeout applies only to the second phase of a reboot, after all regular services are terminated and the system and service manager process has been replaced by the systemd-shutdown binary.

This means that reboot hangs that occur due to misbehaving and/or un-killable processes are not handled by this timeout. The manpage for systemd.conf is a bit confusing as it says:

During the first phase of the shutdown operation the system and service manager remains running and hence RuntimeWatchdogSec= is still honoured.

...but then it says:

In order to define a timeout on this first phase of system shutdown, configure JobTimeoutSec= and JobTimeoutAction= in the [Unit] section of the shutdown.target unit.

So it's not 100% clear to me whether we need to additionally modify the shutdown.target unit.

Related to this issue is the matter of whether we actually have any test cases to validate watchdog behavior during shutdown.

@tonyespy
Copy link
Author

tonyespy commented Jul 7, 2022

Also found this reply from Lennart which adds some additional context:

https://systemd-devel.freedesktop.narkive.com/sF9dAPsy/systemd-issues-related-to-watchdog

@tonyespy
Copy link
Author

tonyespy commented Jul 7, 2022

And this which clarifies how the shutdown-timeout is actually used:

https://utcc.utoronto.ca/~cks/space/blog/linux/SystemdShutdownWatchdog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant