Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: default to 4MiB stack size and add env var to control it #8862

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Aug 29, 2024

Motivation

In #8832 I get tokio runtime worker stack overflow errors in debug builds.

In a similar vein, I had tokio runtimer worker stack overflow when trying to eliminate async_trait (#8296).

The 2MiB default is kind of arbitrary - so this PR bumps it to 4MiB.

It also adds an env var to control it.

Risk Assessment

With our 4 runtimes, the worst case stack memory usage is 4 (runtimes) * ($num_cpus (executor threads) + 512 (blocking pool threads)) * 4MiB.

On i3en.3xlarge, that's 8384 MiB.
On im4gn.2xlarge, that's 8320 MiB.
Before this change, it was half that.

Looking at production metrics, we do have the headroom to accomodate this worst case case.

Alternatives

The problems only occur with debug builds, so technically we could only raise the stack size for debug builds.

However, it would be another configuration where debug != release.

Future Work

If we ever enable single runtime mode in prod (=> #7312 ) then the worst case will drop to 25% of its current value.

Eliminating the use of tokio::spawn_blocking / tokio::fs in favor of tokio-epoll-uring (=> #7370 ) would reduce the worst case to 4 (runtimes) * $num_cpus (executor threads) * 4 MiB.

@problame problame self-assigned this Aug 29, 2024
@arpad-m
Copy link
Member

arpad-m commented Aug 29, 2024

cc #8296 which wanted to increase it for safekeepers

@problame problame merged commit c748140 into main Aug 29, 2024
67 checks passed
@problame problame deleted the problame/4-mib-stacks branch August 29, 2024 12:02
Copy link

3787 tests run: 3681 passed, 0 failed, 106 skipped (full report)


Flaky tests (2)

Postgres 16

Code coverage* (full report)

  • functions: 32.5% (7404 of 22763 functions)
  • lines: 50.7% (60046 of 118536 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bc89002 at 2024-08-29T12:02:55.707Z :recycle:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants