You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In internal testing, we've noticed a deadlock on machines that have nsncd installed, due to limitations on how systemd socket activation works.
The basic setup is we have a custom systemd password agent (which is path-activated as recommended) that makes NSS lookups. That agent is called by systemd-cryptsetup to get a password to unlock the /home partition. We can see from debug-level systemd logging that, when our agent makes an NSS lookup, systemd sees traffic on the NSCD socket and the nsncd.service/start job gets enqueued, but it doesn't actually execute that job and start nsncd until the existing disk-decryption job has timed out after a few minutes of waiting for a password.
This seems very similar to the following bug reports:
https://bugs.freedesktop.org/show_bug.cgi?id=98254 , in which cloud-init tries to socket-activate D-Bus. As it happens, this is also an NSS issue - they're using nss_resolved, which makes a D-Bus query, which activates the D-Bus socket.
I don't totally understand the rules of what systemd is doing, but I think the general rule is that only one job gets run at a time, and that job has to finish before the next one starts. This works out okay for the normal case of socket activation, because traffic only happens after the calling service has started, and so systemd is ready to run a new job promptly. But it seems like certain types of services (NSS, in particular) are called from contexts where socket activation simply should not be used.
(In some testing, it seems like it's fine to socket-activate from a Type=notify service before notification has happened... but I don't yet understand what makes this case different.)
In the interest of robustness, it seems like it's best to just drop socket activation entirely, and also not bother with having nsncd notify readiness to systemd. nsncd should just open the socket at startup and listen on it. The NSCD protocol works fine when the socket hasn't doesn't exist yet (it just falls back to direct NSS queries) and everything run during the boot process ought to use the host libc and therefore not really require nsncd.
I think it's also best to avoid readiness notification, because I don't totally understand whether the time between starting nsncd and getting the readiness notification counts as a running job and what that means for the job queue. We should just make nsncd a Type=simple service. This might cause race conditions with services that require nsncd (i.e., services with a foreign libc) that try to say Requires=nsncd.service, because they don't know when the NSCD socket has actually been created. If that causes a problem in practice for anyone, we should come back and add readiness notification and test it a bit more.
The text was updated successfully, but these errors were encountered:
In internal testing, we've noticed a deadlock on machines that have nsncd installed, due to limitations on how systemd socket activation works.
The basic setup is we have a custom systemd password agent (which is path-activated as recommended) that makes NSS lookups. That agent is called by systemd-cryptsetup to get a password to unlock the /home partition. We can see from debug-level systemd logging that, when our agent makes an NSS lookup, systemd sees traffic on the NSCD socket and the
nsncd.service/start
job gets enqueued, but it doesn't actually execute that job and start nsncd until the existing disk-decryption job has timed out after a few minutes of waiting for a password.This seems very similar to the following bug reports:
systemctl restart
(without--no-block
) in anif-up.d
script deadlocks, because theif-up.d
script is called from activating a udev unit.I don't totally understand the rules of what systemd is doing, but I think the general rule is that only one job gets run at a time, and that job has to finish before the next one starts. This works out okay for the normal case of socket activation, because traffic only happens after the calling service has started, and so systemd is ready to run a new job promptly. But it seems like certain types of services (NSS, in particular) are called from contexts where socket activation simply should not be used.
(In some testing, it seems like it's fine to socket-activate from a
Type=notify
service before notification has happened... but I don't yet understand what makes this case different.)In the interest of robustness, it seems like it's best to just drop socket activation entirely, and also not bother with having nsncd notify readiness to systemd. nsncd should just open the socket at startup and listen on it. The NSCD protocol works fine when the socket hasn't doesn't exist yet (it just falls back to direct NSS queries) and everything run during the boot process ought to use the host libc and therefore not really require nsncd.
I think it's also best to avoid readiness notification, because I don't totally understand whether the time between starting nsncd and getting the readiness notification counts as a running job and what that means for the job queue. We should just make nsncd a
Type=simple
service. This might cause race conditions with services that require nsncd (i.e., services with a foreign libc) that try to sayRequires=nsncd.service
, because they don't know when the NSCD socket has actually been created. If that causes a problem in practice for anyone, we should come back and add readiness notification and test it a bit more.The text was updated successfully, but these errors were encountered: