-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix delay on reboot or power off #859
Conversation
Are we confident that this isn't going to cause the problem that 75125f6 was supposed to fix? In particular, it sounds like shutdown would block when there were containers that needed to exit and the containerd-shim processes weren't receiving appropriate signals. Can you test with running containers (both host-containers and orchestrated containers) and make sure that shutdown does not block? |
Per testing output, all the containerd-shim processes are sent Mostly this results in the same behavior as before; with
That's indeed what I tested. The shutdown timeout (and revert) was the original fix. I added the global stop timeout after observing that the |
Thanks for verifying!
It wasn't called out in the "Testing done" description, so I wanted to make sure. |
packages/systemd/9004-core-add-separate-timeout-for-system-shutdown.patch
Show resolved
Hide resolved
packages/systemd/9004-core-add-separate-timeout-for-system-shutdown.patch
Outdated
Show resolved
Hide resolved
If we are still running processes during shutdown, they are likely to be running in containers rather than managed by the host system. They may not expect or respond to SIGTERM, which can delay the restart for up to 90 seconds. In the case of an update, we expect containers to be drained by the operator before the system is restarted. However, if the system is powered off directly then this may not happen. With the network down, it is unlikely that processes can complete any useful work apart from syncing data to disk. A lower timeout means we will reboot or power off more quickly, which allows the node or its replacement to come up faster. Signed-off-by: Ben Cressey <bcressey@amazon.com>
This reverts commit 75125f6. The reason for changing KillMode was to make the system shut down more quickly. However, this is flawed in practice because although the containerd shims are killed more quickly, the processes running inside the containers are not since they are no longer part of the unit's control group. This configuration is also discouraged by upstream, as it means that the containerd service cannot be safely restarted.
The default timeout for start and stop is 90 seconds, and none of the current host services require this much time. If they did, we could override the setting locally in the service unit. Many of the processes on the system end up running inside scopes that are dynamically created by the orchestrator agent. By changing the default timeout, we ensure that these processes are stopped quickly during shutdown or restart. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Issue number:
#858
Description of changes:
This reverts the
KillMode=mixed
change from 75125f6, and replaces it with lower timeouts for starting and stopping services.Testing done:
Terminated an instance through the API.
New settings are applied:
Relevant console output during shutdown from an instance with running pods and host containers:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.