Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Don't bump RLIMIT_NOFILE in exec sessions with '--ulimit host' #24243

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

debarshiray
Copy link
Member

@debarshiray debarshiray commented Oct 11, 2024

Starting from commit 9126b45 ("Up default Podman rlimits to avoid max open files"), Podman started bumping its soft limit for the maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to permit exposing a large number of ports to a container. This was later fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions running in containers created with:

  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command line environment for software development and troubleshooting the host operating system.

It confuses developers and system administrators debugging a process that's leaking file descriptors and crashing on the host OS. The crashes either don't reproduce inside the container or they take a lot longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this specific scenario.

It turns out that since this code was written, the Go runtime has had two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for RLIMIT_NOFILE for all Go programs [2]. This means that there's no longer any need for Podman to bump it's own limits, because it switched from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang requirement from 1.18 to 1.20"). It's probably good to still log the detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got propagated to child processes spawned by Go programs. Among other things, this can break old programs using select(2) [4]. So, Go's behaviour was fine-tuned to restore the original soft limit for RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping of RLIMIT_NOFILE is left to the Go runtime, then the limits are no longer increased for podman exec sessions. Otherwise, if Podman continues to bump the soft limit for RLIMIT_NOFILE on its own, then it prevents the Go runtime from restoring the original limits when forking, and leads to the higher limits in podman exec sessions.

The existing podman run --ulimit host ... ulimit -Hn test in test/e2e/run_test.go was extended to also check the soft limit. The similar test for podman exec was moved from test/e2e/toolbox_test.go to test/e2e/exec_test.go for consistency and because there's nothing Toolbx specific about it. The test was similarly extended, and updated to be more idiomatic.

Due to the behaviour of the Go runtime noted above, and since the tests are written in Go, the current or soft limit for RLIMIT_NOFILE returned by syscall.Getrlimit() is the same as the hard limit.

The Alpine Linux image doesn't have a standalone binary for ulimit and it's picky about the order in which the options are listed. The -H or -S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
golang/go@8427429c592588af
golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
golang/go@f5eef58e4381259c
golang/go#46279

Fixes: #17681

Does this PR introduce a user-facing change?

Retained the soft limit for the maximum number of open file descriptors (RLIMIT_NOFILE or 'ulimit -n') for containers created with '--ulimit host'.

@openshift-ci openshift-ci bot added the do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None label Oct 11, 2024
Copy link
Contributor

openshift-ci bot commented Oct 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: debarshiray
Once this PR has been reviewed and has the lgtm label, please assign edsantiago for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@debarshiray debarshiray force-pushed the wip/rishi/issues-17681 branch 2 times, most recently from d53ce41 to aba3b60 Compare October 11, 2024 22:03
Starting from commit 9126b45 ("Up default Podman rlimits to
avoid max open files"), Podman started bumping its soft limit for the
maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to
permit exposing a large number of ports to a container.  This was later
fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also
without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions
running in containers created with:
  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command
line environment for software development and troubleshooting the host
operating system.

It confuses developers and system administrators debugging a process
that's leaking file descriptors and crashing on the host OS.  The
crashes either don't reproduce inside the container or they take a lot
longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this
specific scenario.

It turns out that since this code was written, the Go runtime has had
two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE for all Go programs [2].  This means that there's no
longer any need for Podman to bump it's own limits, because it switched
from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang
requirement from 1.18 to 1.20").  It's probably good to still log the
detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got
propagated to child processes spawned by Go programs.  Among other
things, this can break old programs using select(2) [4].  So, Go's
behaviour was fine-tuned to restore the original soft limit for
RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping
of RLIMIT_NOFILE is left to the Go runtime, then the limits are no
longer increased for 'podman exec' sessions.  Otherwise, if Podman
continues to bump the soft limit for RLIMIT_NOFILE on its own, then it
prevents the Go runtime from restoring the original limits when forking,
and leads to the higher limits in 'podman exec' sessions.

The existing 'podman run --ulimit host ... ulimit -Hn' test in
test/e2e/run_test.go was extended to also check the soft limit.  The
similar test for 'podman exec' was moved from test/e2e/toolbox_test.go
to test/e2e/exec_test.go for consistency and because there's nothing
Toolbx specific about it.  The test was similarly extended, and updated
to be more idiomatic.

Due to the behaviour of the Go runtime noted above, and since the tests
are written in Go, the current or soft limit for RLIMIT_NOFILE returned
by syscall.Getrlimit() is the same as the hard limit.

The Alpine Linux image doesn't have a standalone binary for 'ulimit' and
it's picky about the order in which the options are listed.  The -H or
-S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
    golang/go@8427429c592588af
    golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
    golang/go@f5eef58e4381259c
    golang/go#46279

Fixes: containers#17681

Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
@debarshiray debarshiray force-pushed the wip/rishi/issues-17681 branch from aba3b60 to 302d638 Compare October 14, 2024 00:08
@debarshiray
Copy link
Member Author

I don't quite understand what is going on here.

In my local builds on a Fedora 40 host, this doesn't change the outcome of podman run ... ulimit -Sn, but apparently it does for the CI. I wonder if there's some other piece in the puzzle that's behaving differently.

@debarshiray
Copy link
Member Author

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

I don't know what to do here, because RELEASE_NOTES.md in the main branch has the release notes for 5.1.0 that was actually released from the v5.1 branch.

@Luap99
Copy link
Member

Luap99 commented Oct 14, 2024

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

I don't know what to do here, because RELEASE_NOTES.md in the main branch has the release notes for 5.1.0 that was actually released from the v5.1 branch.

You have to add the release note to the PR description, see the template text under Does this PR introduce a user-facing change?

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get a conflict after #24228 merges,
but fundamentally I do not see this as correct change. We really want to keep the high ulimit for (most?) of our child processes.

For example the dariwn code was added in #20643 becuase we needed a higher limit for qemu, while we no longe ruse qemu there I would think it applies the same to vfkit and we still use qemu on linux so this would break it there as well.
And as you pointed out it was added originally for the port forwarding logic which opens a lot of fds, this problem is not solved by the go runtime code, all the fds are passed into conmon as root so it also needs a high limit. And as rootless we use pasta or slirp4netns which open the ports so they all need higher limits.
And there might be other child process as well that I don't know of right now.

As such I do not think your current change is correct

sudo bin/podman run -p 1000-2000 quay.io/libpod/testimage:20240123 ip a
Error: OCI runtime error: crun: openat2 `proc/bus`: Too many open files

I think to actually solve your issue it would make more sense to store the host limits in a global var and the for --ulimit host set them in the container spec accordingly from that.

@openshift-ci openshift-ci bot added release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Oct 14, 2024
@debarshiray
Copy link
Member Author

Thanks for the review!

This will get a conflict after #24228 merges

Yes, I saw it fly by and it reminded me of this problem.

but fundamentally I do not see this as correct change. We really want to keep the high ulimit for (most?) of our child processes.

Right. I didn't know if the higher limit is necessary for the child processes or just the podman(1) process.

I didn't test the port forwarding with my local builds because pasta(1) kept erroring out. So I had to test with --network host. I need to check if I missed some details for my local build.

I think to actually solve your issue it would make more sense to store the host limits in a global var and the for --ulimit host set them in the container spec accordingly from that.

As far as I can make out, a Go program can never get the original soft limit for RLIMIT_NOFILE with syscall.Getrlimit() because the Go runtime has already increased it at start-up.

Instead, it seems what a Go program can do is to not restore the original limit when forking a child process. One way of doing this is to set the limits in Podman, like we do today.

I am wary of embedding the limits in the immutable persistent definition of the container, because if for some reason the limits change in the host OS, then the container will fail to start. The limits can change without any intervention from the system administrator. Just updating the OS and rebooting is enough. I remember hitting this a few times in the past.

So, it might be better to do all this at run-time.

One thing to note, is that the limits for the container's entry point don't matter for Toolbx users. They always get a podman exec ... session.

We could make this conditional on exec sessions in containers with --ulimit host. However, that information is not going to be available so early to earlyInitHook(). We will have to move the code around for that.

@Luap99
Copy link
Member

Luap99 commented Oct 14, 2024

We could make this conditional on exec sessions in containers with --ulimit host. However, that information is not going to be available so early to earlyInitHook(). We will have to move the code around for that.

That really sucks, we have already some things like that I hate it with a passion. It will also not work for the things like the podman system service.

I have not looked deeply into what the go runtime does with the limts and if there is a way to get the real original limits. If the go runtime does not expose things there is already a work around via pkg/rootless c code that runs before the go runtime kicks in so it would in theory be possible to get and store the limits there and then add our own go function to access the real values.

@polarathene
Copy link

polarathene commented Oct 14, 2024

We really want to keep the high ulimit for (most?) of our child processes.

Hey sorry I'm not too familiar with all context and history within the Podman project regarding this (I am bit short on time to investigate), but I was involved in pushing the fix for both Docker (v25) and Containerd (v2.0) to revert their usage of infinity, but specifically in regards to a soft limit above 1024.

Until Containerd has the v2.0 release and Docker does a follow-up release to upgrade to that, it still hasn't quite resolved the issue either. For these two projects it was their systemd .service files at fault however.


Within the container environment itself ulimit -Sn should really be 1024 unless explicitly increased by the user. Just like in the host environment outside of a container behaves. To not do so is known to have caused problems with various software, be that in CPU usage or memory allocation and crashes.

I've only recently used Podman rootless for verifying a docs contribution to a project, where I had to troubleshoot something and was curious if FD limits might be at fault, the container turned out to have 524288:524288. 524288 is expected for the hard limit it's what systemd will set by default, but the soft limit should remain as 1024, as per the 0pointer.net blog post.


Any software running within the container itself that needs a higher soft limit should explicitly request that at runtime, which Go software will do on behalf of the user these days (as already mentioned above).

I am aware of Amazon eagerly adopting the change for AWS and reverting it shortly after due to some breakage it introduced for them, similar to Envoy which has also implicitly relied on the container environment to provide excessive FD limits (a requirement that isn't documented for their software). These are cases where the software has bugs that they should fix internally, not relying on container runtimes implementing "fixes" that negatively impact other software.

FWIW 524288 isn't significantly bad for the resource regression impact (CPU or memory allocation), so you should only have an issue there for software using select() (which many popular software still seem to have in their codebases).


If you'd like any additional reference links for what I've mentioned above, let me know and I'll dig those up for you 👍

@giuseppe
Copy link
Member

I think this is fixed with: #24228

@Luap99
Copy link
Member

Luap99 commented Oct 15, 2024

I think this is fixed with: #24228

No that still bumps the soft limit to the hard limit so --ulimit host will not have the proper soft limit set in the container.

@giuseppe
Copy link
Member

I think this is fixed with: #24228

No that still bumps the soft limit to the hard limit so --ulimit host will not have the proper soft limit set in the container.

yes, true. Since we set the values we expect in the OCI configuration, we could probably just drop setRLimits. @debarshiray would you mind rebasing the PR and drop setRLimits altogether?

@Luap99
Copy link
Member

Luap99 commented Oct 16, 2024

As written above we cannot drop it because then our child processes do not have the high limit, #24243 (review) causing other problems

@debarshiray
Copy link
Member Author

I was waiting for #24228 to get merged. I will get back to this pull request now.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2024
@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@debarshiray debarshiray marked this pull request as draft October 30, 2024 00:22
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2024
@debarshiray debarshiray changed the title Don't bump RLIMIT_NOFILE in exec sessions with '--ulimit host' [WIP] Don't bump RLIMIT_NOFILE in exec sessions with '--ulimit host' Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'--ulimit host' permits more open files in the container
5 participants