systemd deadlock with socket activation #9

geofft · 2021-03-08T18:49:08Z

In internal testing, we've noticed a deadlock on machines that have nsncd installed, due to limitations on how systemd socket activation works.

The basic setup is we have a custom systemd password agent (which is path-activated as recommended) that makes NSS lookups. That agent is called by systemd-cryptsetup to get a password to unlock the /home partition. We can see from debug-level systemd logging that, when our agent makes an NSS lookup, systemd sees traffic on the NSCD socket and the nsncd.service/start job gets enqueued, but it doesn't actually execute that job and start nsncd until the existing disk-decryption job has timed out after a few minutes of waiting for a password.

This seems very similar to the following bug reports:

https://bugs.freedesktop.org/show_bug.cgi?id=98254 , in which cloud-init tries to socket-activate D-Bus. As it happens, this is also an NSS issue - they're using nss_resolved, which makes a D-Bus query, which activates the D-Bus socket.
https://lists.freedesktop.org/archives/systemd-devel/2015-February/027966.html , in which running systemctl restart (without --no-block) in an if-up.d script deadlocks, because the if-up.d script is called from activating a udev unit.

I don't totally understand the rules of what systemd is doing, but I think the general rule is that only one job gets run at a time, and that job has to finish before the next one starts. This works out okay for the normal case of socket activation, because traffic only happens after the calling service has started, and so systemd is ready to run a new job promptly. But it seems like certain types of services (NSS, in particular) are called from contexts where socket activation simply should not be used.

(In some testing, it seems like it's fine to socket-activate from a Type=notify service before notification has happened... but I don't yet understand what makes this case different.)

In the interest of robustness, it seems like it's best to just drop socket activation entirely, and also not bother with having nsncd notify readiness to systemd. nsncd should just open the socket at startup and listen on it. The NSCD protocol works fine when the socket hasn't doesn't exist yet (it just falls back to direct NSS queries) and everything run during the boot process ought to use the host libc and therefore not really require nsncd.

I think it's also best to avoid readiness notification, because I don't totally understand whether the time between starting nsncd and getting the readiness notification counts as a running job and what that means for the job queue. We should just make nsncd a Type=simple service. This might cause race conditions with services that require nsncd (i.e., services with a foreign libc) that try to say Requires=nsncd.service, because they don't know when the NSCD socket has actually been created. If that causes a problem in practice for anyone, we should come back and add readiness notification and test it a bit more.

The text was updated successfully, but these errors were encountered:

Fixes #9. Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>

geofft added a commit that referenced this issue Mar 8, 2021

Don't use systemd socket activation becuase startup ordering is weird.

e65e570

Fixes #9. Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>

geofft added a commit that referenced this issue Mar 8, 2021

Don't use systemd socket activation becuase startup ordering is weird.

84a9a1f

Fixes #9. Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>

geofft added a commit that referenced this issue Mar 8, 2021

Don't use systemd socket activation becuase startup ordering is weird.

4bc0c9b

Fixes #9. Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>

geofft mentioned this issue Mar 8, 2021

Don't use systemd socket activation becuase startup ordering is weird. #10

Merged

geofft closed this as completed in #10 Mar 9, 2021

leifwalsh mentioned this issue May 17, 2021

Run as non-root #14

Open

leifwalsh mentioned this issue Oct 10, 2022

use sd-notify to signal readyness #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd deadlock with socket activation #9

systemd deadlock with socket activation #9

geofft commented Mar 8, 2021

systemd deadlock with socket activation #9

systemd deadlock with socket activation #9

Comments

geofft commented Mar 8, 2021