pageserver: default to 4MiB stack size and add env var to control it #8862

problame · 2024-08-29T10:58:49Z

Motivation

In #8832 I get tokio runtime worker stack overflow errors in debug builds.

In a similar vein, I had tokio runtimer worker stack overflow when trying to eliminate async_trait (#8296).

The 2MiB default is kind of arbitrary - so this PR bumps it to 4MiB.

It also adds an env var to control it.

Risk Assessment

With our 4 runtimes, the worst case stack memory usage is 4 (runtimes) * ($num_cpus (executor threads) + 512 (blocking pool threads)) * 4MiB.

On i3en.3xlarge, that's 8384 MiB.
On im4gn.2xlarge, that's 8320 MiB.
Before this change, it was half that.

Looking at production metrics, we do have the headroom to accomodate this worst case case.

Alternatives

The problems only occur with debug builds, so technically we could only raise the stack size for debug builds.

However, it would be another configuration where debug != release.

Future Work

If we ever enable single runtime mode in prod (=> #7312 ) then the worst case will drop to 25% of its current value.

Eliminating the use of tokio::spawn_blocking / tokio::fs in favor of tokio-epoll-uring (=> #7370 ) would reduce the worst case to 4 (runtimes) * $num_cpus (executor threads) * 4 MiB.

arpad-m · 2024-08-29T11:06:03Z

cc #8296 which wanted to increase it for safekeepers

github-actions · 2024-08-29T12:02:56Z

3787 tests run: 3681 passed, 0 failed, 106 skipped (full report)

Flaky tests (2)

Postgres 16

test_replica_start_scan_clog_crashed_xids: release-arm64
test_delete_timeline_client_hangup: debug-x86-64

Code coverage* (full report)

functions: 32.5% (7404 of 22763 functions)
lines: 50.7% (60046 of 118536 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
bc89002 at 2024-08-29T12:02:55.707Z :recycle:}

default to 4MiB stack size and add env var to control it

bc89002

problame requested a review from a team as a code owner August 29, 2024 10:58

problame requested review from jcsp, arpad-m and koivunej August 29, 2024 10:58

problame mentioned this pull request Aug 29, 2024

tenant background loops: periodic log message if long-running iteration #8832

Merged

koivunej approved these changes Aug 29, 2024

View reviewed changes

problame self-assigned this Aug 29, 2024

jcsp approved these changes Aug 29, 2024

View reviewed changes

arpad-m approved these changes Aug 29, 2024

View reviewed changes

problame merged commit c748140 into main Aug 29, 2024
67 checks passed

problame deleted the problame/4-mib-stacks branch August 29, 2024 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: default to 4MiB stack size and add env var to control it #8862

pageserver: default to 4MiB stack size and add env var to control it #8862

problame commented Aug 29, 2024 •

edited

Loading

arpad-m commented Aug 29, 2024

github-actions bot commented Aug 29, 2024

Postgres 16

pageserver: default to 4MiB stack size and add env var to control it #8862

pageserver: default to 4MiB stack size and add env var to control it #8862

Conversation

problame commented Aug 29, 2024 • edited Loading

Motivation

Risk Assessment

Alternatives

Future Work

arpad-m commented Aug 29, 2024

github-actions bot commented Aug 29, 2024

3787 tests run: 3681 passed, 0 failed, 106 skipped (full report)

Postgres 16

Code coverage* (full report)

problame commented Aug 29, 2024 •

edited

Loading