Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd deadlock with socket activation #9

Closed
geofft opened this issue Mar 8, 2021 · 0 comments · Fixed by #10
Closed

systemd deadlock with socket activation #9

geofft opened this issue Mar 8, 2021 · 0 comments · Fixed by #10

Comments

@geofft
Copy link
Collaborator

geofft commented Mar 8, 2021

In internal testing, we've noticed a deadlock on machines that have nsncd installed, due to limitations on how systemd socket activation works.

The basic setup is we have a custom systemd password agent (which is path-activated as recommended) that makes NSS lookups. That agent is called by systemd-cryptsetup to get a password to unlock the /home partition. We can see from debug-level systemd logging that, when our agent makes an NSS lookup, systemd sees traffic on the NSCD socket and the nsncd.service/start job gets enqueued, but it doesn't actually execute that job and start nsncd until the existing disk-decryption job has timed out after a few minutes of waiting for a password.

This seems very similar to the following bug reports:

I don't totally understand the rules of what systemd is doing, but I think the general rule is that only one job gets run at a time, and that job has to finish before the next one starts. This works out okay for the normal case of socket activation, because traffic only happens after the calling service has started, and so systemd is ready to run a new job promptly. But it seems like certain types of services (NSS, in particular) are called from contexts where socket activation simply should not be used.

(In some testing, it seems like it's fine to socket-activate from a Type=notify service before notification has happened... but I don't yet understand what makes this case different.)

In the interest of robustness, it seems like it's best to just drop socket activation entirely, and also not bother with having nsncd notify readiness to systemd. nsncd should just open the socket at startup and listen on it. The NSCD protocol works fine when the socket hasn't doesn't exist yet (it just falls back to direct NSS queries) and everything run during the boot process ought to use the host libc and therefore not really require nsncd.

I think it's also best to avoid readiness notification, because I don't totally understand whether the time between starting nsncd and getting the readiness notification counts as a running job and what that means for the job queue. We should just make nsncd a Type=simple service. This might cause race conditions with services that require nsncd (i.e., services with a foreign libc) that try to say Requires=nsncd.service, because they don't know when the NSCD socket has actually been created. If that causes a problem in practice for anyone, we should come back and add readiness notification and test it a bit more.

geofft added a commit that referenced this issue Mar 8, 2021
Fixes #9.

Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>
geofft added a commit that referenced this issue Mar 8, 2021
Fixes #9.

Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>
geofft added a commit that referenced this issue Mar 8, 2021
Fixes #9.

Co-authored-by: Geoffrey Thomas <geofft@twosigma.com>
@geofft geofft closed this as completed in #10 Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant