Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Bodobolero · 2024-06-17T15:34:57Z

Problem

Summary of changes

the neon_local subcommands to

start neon
start pageserver
start safekeeper
start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute.

Example from the test execution

RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"

logfile

control_plane/src/bin/neon_local.rs

github-actions · 2024-06-17T16:18:33Z

3341 tests run: 3215 passed, 0 failed, 126 skipped (full report)

Flaky tests (3)

Postgres 16

test_change_pageserver: debug
test_secondary_background_downloads: release

Postgres 14

test_subscriber_restart: debug

Code coverage* (full report)

functions: 32.3% (6842 of 21153 functions)
lines: 49.8% (53337 of 107157 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cfd8645 at 2024-06-21T10:44:23.313Z :recycle:}

hlinnaka · 2024-06-18T10:15:52Z

No objections, but I wonder if this is just papering over the real issue:

This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

Why does having a lot of layer data make the pageserver start up slow? There shouldn't be anything in the startup codepath that scales with the amount of data or # of layers, right?

control_plane/src/bin/neon_local.rs

control_plane/src/background_process.rs

Bodobolero · 2024-06-18T11:45:22Z

No objections, but I wonder if this is just papering over the real issue:

This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

Why does having a lot of layer data make the pageserver start up slow? There shouldn't be anything in the startup codepath that scales with the amount of data or # of layers, right?

We discussed this on slack and agreed on increasing the timeout

problame

+1 for rust bits

control_plane/src/bin/neon_local.rs

control_plane/src/background_process.rs

sharnoff · 2024-06-19T16:55:48Z

(maybe off-topic, but why is autoscaling review requested?)

Bodobolero · 2024-06-21T09:15:19Z

(maybe off-topic, but why is autoscaling review requested?)

good question, I did not explicitly request it, maybe some code ownership rules of affected files?

… controller in test cases to make test cases less flaky (#8079) ## Problem see #8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```

Bodobolero added 3 commits June 17, 2024 14:41

add retry timeout option to neon_local command

fdb555c

increase timeout for pageserver start in flaky pagebench test

af4fc6d

fix bug found in testing

aa31adf

Bodobolero requested review from problame and bayandin June 17, 2024 15:35

bayandin reviewed Jun 17, 2024

View reviewed changes

logfile Outdated Show resolved Hide resolved

control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved

cargo fmt

325ce67

Bodobolero added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Jun 17, 2024

Bodobolero added 2 commits June 17, 2024 18:27

remove logfile staged by mistake

9cd4848

rename externally visible option to '-t' and '--start-timeout'

a8cddb6

Bodobolero requested a review from bayandin June 17, 2024 17:01

problame requested changes Jun 18, 2024

View reviewed changes

review comments - use time::Duration and humatime::Duration for parsing

9304f13

Bodobolero requested a review from problame June 19, 2024 09:32

problame approved these changes Jun 19, 2024

View reviewed changes

control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved

control_plane/src/background_process.rs Outdated Show resolved Hide resolved

review comments

9f16000

bayandin approved these changes Jun 19, 2024

View reviewed changes

Bodobolero enabled auto-merge (squash) June 19, 2024 11:00

Bodobolero requested review from a team as code owners June 19, 2024 12:15

Bodobolero requested review from petuhovskiy, tristan957, Omrigan and mtyazici June 19, 2024 12:15

Bodobolero force-pushed the bodobolero/variable_timeout_neonlocal branch from 33c8107 to 9f16000 Compare June 19, 2024 12:22

merged

21c1442

invalid retry interval

cfd8645

Bodobolero removed request for a team, Omrigan, petuhovskiy, tristan957 and mtyazici June 21, 2024 09:16

Bodobolero merged commit 82266a2 into main Jun 21, 2024
66 of 68 checks passed

Bodobolero deleted the bodobolero/variable_timeout_neonlocal branch June 21, 2024 10:36

Bodobolero mentioned this pull request Jun 25, 2024

run pagebench performance tests on predictable, developer-reproducible infrastructure #6297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Bodobolero commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Jun 17, 2024 •

edited

Loading

Postgres 16

Postgres 14

hlinnaka commented Jun 18, 2024

Bodobolero commented Jun 18, 2024

problame left a comment

sharnoff commented Jun 19, 2024

Bodobolero commented Jun 21, 2024

Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Conversation

Bodobolero commented Jun 17, 2024 • edited Loading

Problem

Summary of changes

github-actions bot commented Jun 17, 2024 • edited Loading

3341 tests run: 3215 passed, 0 failed, 126 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

hlinnaka commented Jun 18, 2024

Bodobolero commented Jun 18, 2024

problame left a comment

Choose a reason for hiding this comment

sharnoff commented Jun 19, 2024

Bodobolero commented Jun 21, 2024

Bodobolero commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Jun 17, 2024 •

edited

Loading