Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pagebench testcase is still flaky #8070

Open
Bodobolero opened this issue Jun 17, 2024 · 0 comments
Open

pagebench testcase is still flaky #8070

Bodobolero opened this issue Jun 17, 2024 · 0 comments
Assignees
Labels
t/bug Issue Type: Bug

Comments

@Bodobolero
Copy link
Contributor

Steps to reproduce

test_pageserver_max_throughput_getpage_at_latest_lsn is still flaky, see

Benchmarks failed on main: f670101

Expected result

Benchmark runs successfully if the code under test is not broken

Actual result

Depending on load on GitHub action runners it fails with

RuntimeError: Run ['/tmp/neon/bin/neon_local', 'pageserver', 'start', '--id=1'] failed:
  stdout:
    Starting pageserver node 1 at 'localhost:15125' in "/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[release-pg14-github-actions-selfhosted-10-6-30]/repo/pageserver_1".....
    pageserver has not started yet, continuing to wait.....
    SIGKILL & wait the started process
  stderr:
    pageserver start failed: pageserver did not start+pass status checks within 10 seconds
``


## Environment


## Logs, links
- 
@Bodobolero Bodobolero added the t/bug Issue Type: Bug label Jun 17, 2024
@Bodobolero Bodobolero self-assigned this Jun 17, 2024
@Bodobolero Bodobolero changed the title pagbench testcase is still flaky pagebench testcase is still flaky Jun 17, 2024
Bodobolero added a commit that referenced this issue Jun 21, 2024
… controller in test cases to make test cases less flaky (#8079)

## Problem

see #8070

## Summary of changes

the neon_local subcommands to 
- start neon
- start pageserver
- start safekeeper
- start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a
longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of
layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test
infrastructure (python fixtures) and modify the flaky testcase to
increase the timeout from 10 seconds to 1 minute.

Example from the test execution

```bash
RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"
```
conradludgate pushed a commit that referenced this issue Jun 27, 2024
… controller in test cases to make test cases less flaky (#8079)

## Problem

see #8070

## Summary of changes

the neon_local subcommands to 
- start neon
- start pageserver
- start safekeeper
- start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a
longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of
layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test
infrastructure (python fixtures) and modify the flaky testcase to
increase the timeout from 10 seconds to 1 minute.

Example from the test execution

```bash
RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant