-
-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable nscd caching #50316
Disable nscd caching #50316
Conversation
67b7153
to
f667a18
Compare
Supersedes #50042 if we also disable host caching |
CC @Mic92 . We could also replace it with |
I haven't actually tried setting |
I think we should first disable negative-time-to-live caches for
This should fix problems people have when they cannot reach a site, when the hostname was resolved while having no connectivity. |
I think this is worth mentioning in the release notes. |
We might as well disable the positive TTL too and say that nscd is not for caching. Whenever something unexplained happens with name resolution, my first step is always just to restart nscd which solves it nine times out of 10. |
Hm, what's the advantage of disabling caching of positive results? |
Basically I wanted to be 100% sure nothing fishy is happening. I disabled everything to be 100% sure no funny stuff is happening. Systemd project acknowledged that they are not nscd-aware and they might look into fixing that. This means there might be more subtle bugs lurking in the presence of caching. I'm not sure if the positive cache is invalidated when a systemd unit with Dynamic users stops. If it doesn't, this might cause funny bugs when another systemd service with Dynamic users recycles that uid and discovers there's already a name associated to the uid. user lookups are basically instantaneous and the only moment we would want to cache them is in the presence of LDAP to improve performance. But sssd already does its own caching and so we already disable nscd caching when sssd is enabled in NixOS and Red Hat (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd) So the only moment we seem to have caching is when the passwd database is backed by a file. I think this is a bit overkill and will cause more issues than gains. Edit: @edolstra oh I think I now just realised you were talking about caching of hosts, not the other caches. |
I've disabled negative caching for hosts for now. and left positive caching in. I'm not very familiar with the release process of NixOS. What should a release notes entry look like, and where do I put it? |
You can take the last release as an example: https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-1809.xml your one should go here: https://github.com/NixOS/nixpkgs/blob/master/nixos/doc/manual/release-notes/rl-1903.xml |
Now that |
@arianvp when they are 100% the same, we can drop the sssd one. |
suggested-size group 211 | ||
check-files group yes | ||
persistent group no | ||
shared group yes | ||
|
||
enable-cache hosts yes | ||
positive-time-to-live hosts 600 | ||
negative-time-to-live hosts 5 | ||
negative-time-to-live hosts 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we at least should backport this one, since it can be quiet annoying if websites do not load correctly after connecting to a hotspot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I’ve had this problem for more than a month now, so the release must have it as well.
We don't gain anything by having it enabled and disabling it removes one potential source of problems. |
<listitem> | ||
<para> | ||
The <literal>nscd</literal> now disables all caching of | ||
<literal>passwd</literal> and <literal>group</litral> databases by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be </literal>
the closing tag in group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nix build -f my-nixpkgs/nixos config.system.build.manual --arg configuration {}
to make sure the manual builds :)
@peterhoeg we gain a dns cache if it is enabled. I don't see how this would cause problems if it caches valid entries. DNS is designed around for those caches and uses TTLs itself to control invalidation. |
@edolstra I reverted your revert, as I don't think it fixes anything. I'm 100% sure that nscd already invalidates the cache when |
b4e2fbd
to
aeb4aa1
Compare
After reading the discussion at #42569 perhaps we should only unrevert the commit Eelco did, after we fully disable caching in I think we should merge this as is, and then in a next PR:
|
<literal>libnss_systemd.so</literal> module which is used by | ||
<literal>systemd</literal> to manage uids and usernames in the presence | ||
of <literal>DynamicUser=</literal> in systemd services. | ||
The was already the default behaviour in presence of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The was already the default behaviour in presence of | |
This was already the default behaviour in presence of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more doc feedback
I have addressed the doc issues. I think this is ready for merge. I have created a follow-up ticket with the rest of the tasks that popped up in this discussion #51911 |
@@ -4,25 +4,41 @@ paranoia no | |||
debug-level 0 | |||
|
|||
enable-cache passwd yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to disable caches where we set a zero ttl for both positive and negative? Seems less confusing to me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As eelco mentioned, this would have a serious performance penalty. so I don't want to change it until #51911 is implemented. I think this is a good compromise where people do not get failed lookups when switching networks, but do also have performance. Until we figure out the resolved business
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I misread: See the original commit message why we can't do this:
Note that we can not just put in /etc/nscd.conf:
enable-cache passwd no
As this will actually cause glibc to _not_ forward the call to nscd
at all, and thus never reach the nss modules. Instead we set
the negative and positive cache ttls to 0 seconds as a workaround.
This way, Glibc will always forward requests to nscd, but results
will never be cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a comment to nscd.conf
to clarify
Systemd provides an option for allocating DynamicUsers which we want to use in NixOS to harden service configuration. However, we discovered that the user wasn't allocated properly for services. After some digging this turned out to be, of course, a cache inconsistency problem. When a DynamicUser creation is performed, Systemd check beforehand whether the requested user already exists statically. If it does, it bails out. If it doesn't, systemd continues with allocating the user. However, by checking whether the user exists, nscd will store the fact that the user does not exist in it's negative cache. When the service tries to lookup what user is associated to its uid (By calling whoami, for example), it will try to consult libnss_systemd.so However this will read from the cache and tell report that the user doesn't exist, and thus will return that there is no user associated with the uid. It will continue to do so for the cache duration time. If the service doesn't immediately looks up its username, this bug is not triggered, as the cache will be invalidated around this time. However, if the service is quick enough, it might end up in a situation where it's incorrectly reported that the user doesn't exist. Preferably, we would not be using nscd at all. But we need to use it because glibc reads nss modules from /etc/nsswitch.conf by looking relative to the global LD_LIBRARY_PATH. Because LD_LIBRARY_PATH is not set globally (as that would lead to impurities and ABI issues), glibc will fail to find any nss modules. Instead, as a hack, we start up nscd with LD_LIBRARY_PATH set for only that service. Glibc will forward all nss syscalls to nscd, which will then respect the LD_LIBRARY_PATH and only read from locations specified in the NixOS config. we can load nss modules in a pure fashion. However, I think by accident, we just copied over the default settings of nscd, which actually caches user and group lookups. We already disable this when sssd is enabled, as this interferes with the correct working of libnss_sss.so as it already does its own caching of LDAP requests. (See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd) Because nscd caching is now also interferring with libnss_systemd.so and probably also with other nsss modules, lets just pre-emptively disable caching for now for all options related to users and groups, but keep it for caching hosts ans services lookups. Note that we can not just put in /etc/nscd.conf: enable-cache passwd no As this will actually cause glibc to _not_ forward the call to nscd at all, and thus never reach the nss modules. Instead we set the negative and positive cache ttls to 0 seconds as a workaround. This way, Glibc will always forward requests to nscd, but results will never be cached. Fixes NixOS#50273
Hopefully fixes NixOS#50290
It was the last database that wasn't listed.
702a21e
to
1d5f4cb
Compare
LGTM. |
@arianvp Had a discussion at FOSDEM about how we (ab)use nscd to pass nss modules, as described in 1d5f4cb. This seems to be a bad idea, due to glibc falling back to local resolving (with broken nss modules) after very quick timeouts. Can we get glibc to load nss modules from some global |
@flokli This can break if a program was linked against a different version of glibc |
Only in the case where glibc breaks abi right? We do something similar for OpenGL/Vulkan iirc |
I opened #55276 about that. |
@arianvp This could make it impossible to mix releases, but not just for OpenGL applications but every application. |
This effectively disables nscd's built-in hosts cache, which turns out to be erratic in some cases. We only use nscd these days as a more ABI-neutral NSS dispatcher mechanism. Local caching should still be possible with local resolvers in /etc/resolv.conf (via the `dns` NSS module), or without local resolvers via systemd-networkd (via the `resolve` nss module) We don't set enable-cache to no due to NixOS#50316 (comment).
Motivation for this change
Nscd caching was interferring with systemd's nss module. Disable caching as
caching is not the reason why we use nscd. We use it to resolve nss modules correctly. See commit message for more
detailed information.
Fixes #50273
Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nox --run "nox-review wip"
./result/bin/
)nix path-info -S
before and after)