Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_download_remote_layers_api flakyness #3831

Closed
koivunej opened this issue Mar 16, 2023 · 4 comments
Closed

test_download_remote_layers_api flakyness #3831

koivunej opened this issue Mar 16, 2023 · 4 comments
Labels
a/test/flaky Area: related to flaky tests c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3818/debug/4436920822/index.html#suites/b97efae3a617afb71cb8142f5afa5224/ea27292fb12d954/:

AssertionError: current_physical_size is sum of loaded layer sizes, independent of whether local or remote
assert 29835264 == 29417472
  +29835264
  -29417472

From the test it looks this could be caused by wal being received from safekeepers after measuring it. If not, then it's probably something mystical like #3209.

@koivunej koivunej added c/storage/pageserver Component: storage: pageserver a/test/flaky Area: related to flaky tests labels Mar 16, 2023
@arssher
Copy link
Contributor

arssher commented Jun 12, 2023

duplicate of #3422

@jcsp
Copy link
Collaborator

jcsp commented Sep 12, 2023

A recent failure of this test: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5207/6157084684/index.html#suites/b97efae3a617afb71cb8142f5afa5224/9b22f1b52bf4c315

AssertionError: we redownloaded all the layers
assert 29835264.0 == 29876224.0

@koivunej
Copy link
Member Author

koivunej commented Sep 15, 2023

Hehe, it's an issue I have created. Today's analysis:

TL;DR: it definitelty triggers the path I added in #5233.

An instance today which also has a succesful run.

The test log does have the fixup I added:

2023-09-15 04:56:16.274 INFO [test_ondemand_download.py:402] fixing up filled_current_physical from 29835264 to 29876224 (40960)

it seems in this case it works wrong:

AssertionError: we redownloaded all the layers
assert 29835264.0 == 29876224.0

I don't understand. Thought I think without my change in #5233, it would had failed on the previous assert near the "fixing up", so it might be my "fixing up" is only partial.

EDIT: Log analysis later on: logs clearly show inmemory flush after metrics were read during shutdown.

koivunej added a commit that referenced this issue Sep 16, 2023
The test is still flaky, perhaps more after #5233, see #3831.

Do one more `timeline_checkpoint` *after* shutting down safekeepers
*before* shutting down pageserver. Put more effort into not compacting
or creating image layers.
@koivunej
Copy link
Member Author

Finally closing this due to my above log analysis and being able to merge #5322 as it fixes that exact problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky Area: related to flaky tests c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants