Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO_FILE ulimit decreases from 1048576 to 1024 when upgrading to v1.4.2 #7198

Closed
jonkerj opened this issue May 9, 2023 · 1 comment · Fixed by #7199 or #7208
Closed

NO_FILE ulimit decreases from 1048576 to 1024 when upgrading to v1.4.2 #7198

jonkerj opened this issue May 9, 2023 · 1 comment · Fixed by #7199 or #7208
Assignees

Comments

@jonkerj
Copy link
Contributor

jonkerj commented May 9, 2023

Bug Report

Description

When I upgrade from v1.4.1 to v1.4.2, the default ulimit for NO_FILE decreases from 1048576 to 1024:

  1. Deploy single node Talos on QEMU using official ISO
  2. Exec into any pod: ulimit -n, which should show 1048576
  3. Upgrade to v1.4.2
  4. Repeat 2 and observe 1024

My trigger case was InfluxDB (which really loves FDs), complaining about "too many open files", just after I upgraded to v1.4.2 but any pod shows this behavior. After having downgraded to v1.4.1, everything is OK again.

Logs

n/a

Environment

  • Talos version: v1.4.2
  • Kubernetes version: 1.26.3
  • Platform: QEMU
@frezbo frezbo self-assigned this May 9, 2023
@frezbo
Copy link
Member

frezbo commented May 9, 2023

The issue is that now Go honors default RLIMIT for forked processes (which is actually correct). golang/go#46279 (comment)

The fix would be to set a sensible default from Talos when running processes from machined

frezbo added a commit to frezbo/talos that referenced this issue May 9, 2023
Now Go only sets the rlimit for the parent and any fork/exec'ed process
gets the rlimit that was the default before fork/exec. Ref: golang/go#46279

This fix got backported to [Go 1.20.4](golang/go@ecf7e00) breaking Talos.
Talos used to set rlimit in the [`SetRLimit`](https://github.com/siderolabs/talos/blob/v1.4.2/internal/app/machined/pkg/runtime/v1alpha1/v1alpha1_sequencer_tasks.go#L302) sequencer task.
This means any process started by `wrapperd` gets the default Rlimit
(1024). Fix this by explicitly setting `rlimit` in `wrapperd` before we
drop any capabilities.

Fixes: siderolabs#7198

Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit a2565f6)
@frezbo frezbo moved this from Proposed to Accepted in Backports to 1.4 May 10, 2023
frezbo added a commit to frezbo/talos that referenced this issue May 11, 2023
This reverts commit a2565f6.

The fix done in `a2565f67`, was actually a no-op caused by the
misunderstanding the fix done in Go and backported to [Go 1.20.4](golang/go@ecf7e00).
The fix gave a false confidence that it was working when it was tested
against Talos `main` branch since the PR siderolabs#7190 bumped `x/sys` package
from [v0.7.0 -> v0.8.0](golang/go@ecf7e00), the actual change in `x/sys` can be found here at https://cs.opensource.google/go/x/sys/+/ff18efa0a3fa22d4fede381b822bbcfe53b7ee7c which meant that when updating Go to 1.20.4 the `x/sys` package should been updated too. The `x/sys` package changed how the syscall to set the rlimit was called, it got moved into the Go stdlib instead of calling rlimit syscall in the `x/sys` package, which meant a combination of using Go 1.20.4 and an older `x/sys` package means `RLIMIT_NOFILE` value would not be set back to the original value.

The Talos 1.4 release branch currently have  `x/sys`
at [v0.7.0(https://github.com/siderolabs/talos/blob/v1.4.3/go.mod#L133),
so the backport would consist of this change along another commit bumping `x/sys` package to `v0.8.0`.

Fixes: siderolabs#7198
Fixes: siderolabs#7206

Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
frezbo added a commit to frezbo/talos that referenced this issue May 11, 2023
This reverts commit a2565f6.

The fix done in `a2565f67`, was actually a no-op caused by the
misunderstanding the fix done in Go and backported to [Go 1.20.4](golang/go@ecf7e00).
The fix gave a false confidence that it was working when it was tested
against Talos `main` branch since the PR siderolabs#7190 bumped `x/sys` package
from [v0.7.0 -> v0.8.0](golang/go@ecf7e00), the actual change in `x/sys` can be found here at https://cs.opensource.google/go/x/sys/+/ff18efa0a3fa22d4fede381b822bbcfe53b7ee7c which meant that when updating Go to 1.20.4 the `x/sys` package should been updated too. The `x/sys` package changed how the syscall to set the rlimit was called, it got moved into the Go stdlib instead of calling rlimit syscall in the `x/sys` package, which meant a combination of using Go 1.20.4 and an older `x/sys` package means `RLIMIT_NOFILE` value would not be set back to the original value.

The Talos 1.4 release branch currently have  `x/sys`
at [v0.7.0(https://github.com/siderolabs/talos/blob/v1.4.3/go.mod#L133),
so the backport would consist of this change along another commit bumping `x/sys` package to `v0.8.0`.

Fixes: siderolabs#7198
Fixes: siderolabs#7206

Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit 4f720d4)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants