-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079
Conversation
3341 tests run: 3215 passed, 0 failed, 126 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cfd8645 at 2024-06-21T10:44:23.313Z :recycle: |
No objections, but I wonder if this is just papering over the real issue:
Why does having a lot of layer data make the pageserver start up slow? There shouldn't be anything in the startup codepath that scales with the amount of data or # of layers, right? |
We discussed this on slack and agreed on increasing the timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for rust bits
33c8107
to
9f16000
Compare
(maybe off-topic, but why is autoscaling review requested?) |
good question, I did not explicitly request it, maybe some code ownership rules of affected files? |
… controller in test cases to make test cases less flaky (#8079) ## Problem see #8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```
Problem
see #8070
Summary of changes
the neon_local subcommands to
get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.
In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute.
Example from the test execution