Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky #8079

Merged
merged 10 commits into from
Jun 21, 2024

Conversation

Bodobolero
Copy link
Contributor

@Bodobolero Bodobolero commented Jun 17, 2024

Problem

see #8070

Summary of changes

the neon_local subcommands to

  • start neon
  • start pageserver
  • start safekeeper
  • start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute.

Example from the test execution

RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"

@Bodobolero Bodobolero requested review from problame and bayandin June 17, 2024 15:35
logfile Outdated Show resolved Hide resolved
control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Jun 17, 2024

3341 tests run: 3215 passed, 0 failed, 126 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_change_pageserver: debug
  • test_secondary_background_downloads: release

Postgres 14

  • test_subscriber_restart: debug

Code coverage* (full report)

  • functions: 32.3% (6842 of 21153 functions)
  • lines: 49.8% (53337 of 107157 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
cfd8645 at 2024-06-21T10:44:23.313Z :recycle:

@Bodobolero Bodobolero added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Jun 17, 2024
@Bodobolero Bodobolero requested a review from bayandin June 17, 2024 17:01
@hlinnaka
Copy link
Contributor

No objections, but I wonder if this is just papering over the real issue:

This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

Why does having a lot of layer data make the pageserver start up slow? There shouldn't be anything in the startup codepath that scales with the amount of data or # of layers, right?

control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved
control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved
control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved
control_plane/src/background_process.rs Outdated Show resolved Hide resolved
control_plane/src/background_process.rs Outdated Show resolved Hide resolved
@Bodobolero
Copy link
Contributor Author

No objections, but I wonder if this is just papering over the real issue:

This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases.

Why does having a lot of layer data make the pageserver start up slow? There shouldn't be anything in the startup codepath that scales with the amount of data or # of layers, right?

We discussed this on slack and agreed on increasing the timeout

@Bodobolero Bodobolero requested a review from problame June 19, 2024 09:32
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for rust bits

control_plane/src/bin/neon_local.rs Outdated Show resolved Hide resolved
control_plane/src/background_process.rs Outdated Show resolved Hide resolved
@Bodobolero Bodobolero enabled auto-merge (squash) June 19, 2024 11:00
@Bodobolero Bodobolero requested review from a team as code owners June 19, 2024 12:15
@Bodobolero Bodobolero force-pushed the bodobolero/variable_timeout_neonlocal branch from 33c8107 to 9f16000 Compare June 19, 2024 12:22
@sharnoff
Copy link
Member

(maybe off-topic, but why is autoscaling review requested?)

@Bodobolero
Copy link
Contributor Author

(maybe off-topic, but why is autoscaling review requested?)

good question, I did not explicitly request it, maybe some code ownership rules of affected files?

@Bodobolero Bodobolero merged commit 82266a2 into main Jun 21, 2024
66 of 68 checks passed
@Bodobolero Bodobolero deleted the bodobolero/variable_timeout_neonlocal branch June 21, 2024 10:36
conradludgate pushed a commit that referenced this pull request Jun 27, 2024
… controller in test cases to make test cases less flaky (#8079)

## Problem

see #8070

## Summary of changes

the neon_local subcommands to 
- start neon
- start pageserver
- start safekeeper
- start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a
longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of
layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test
infrastructure (python fixtures) and modify the flaky testcase to
increase the timeout from 10 seconds to 1 minute.

Example from the test execution

```bash
RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants