-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability in test_pull_timeline
#9731
Comments
I've went over the seven runs where it failed in the last 14 days, and they all failed on the same reconfiguration request. Log excerpt:
All 7 reproductions were on postgres 14 or 15, and all on release. |
Looking at the compute logs, it seems that the configure operation takes 30 seconds:
Looking at four other reproductions of the seven, it's always 30 seconds. Maybe some timeout that it's running into? When it works, the operation is over much more quickly, within a second, say here:
The compute logs for the buggy case show this around the "gap" of 30 seconds. Apparently it tries to upload traces:
This OpenTelemetry trace error can either be a coincidence, or actually be due to us. From the logs, it seems to hang in the second |
Hmm since today Nov 20,2024 11:05 UTC, it seems to be way more flaky: If I look at the main branch commits, I see a PR having been merged on 11:07 UTC which is safekeeper related: #9364 . But the occurence from 11:05 (a8c1f67) didn't include it, nor did the second one on 11:08 (e5024a5). Both include #9717 however. |
I see 23 flaky failures and 364 total runs of the test, but maybe only a third of the tests actually is susceptible to the issue. So one can probably easily reproduce it now. |
Hmm yeah I really think that #9717 is the culprit, at least for the new issue. it's touched the code that is now printing:
you see the 30 second gap is somewhere in the |
Before, `OpenTelemetry` errors were printed to stdout/stderr directly, causing one of the few log lines without a timestamp, like: ``` OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces) ``` Now, we print: ``` 2024-11-21T02:24:20.511160Z INFO OpenTelemetry error: error sending request for url (http://localhost:4318/v1/traces) ``` I found this while investigating #9731.
It's indeed not hard to make it reproduce after the increase in occurences. I change it to run 16 copies of the test, and if I run all 16 in parallel, there is usually one or two that fails. --- a/test_runner/regress/test_wal_acceptor.py
+++ b/test_runner/regress/test_wal_acceptor.py
@@ -1870,8 +1870,10 @@ def test_delete_timeline_under_load(neon_env_builder: NeonEnvBuilder):
# Basic pull_timeline test.
# When live_sk_change is False, compute is restarted to change set of
# safekeepers; otherwise it is live reload.
-@pytest.mark.parametrize("live_sk_change", [False, True])
-def test_pull_timeline(neon_env_builder: NeonEnvBuilder, live_sk_change: bool):
+@pytest.mark.parametrize("num", range(16))
+def test_pull_timeline(neon_env_builder: NeonEnvBuilder, num: int):
+ live_sk_change = True
+ _num = num
neon_env_builder.auth_enabled = True |
I don't think this issue is in my PR, but in the walproposer reconfiguration system:
Because no commit can get acknowledged without first streaming its WAL to the safekeepers, and those safekeepers not being connected (see the So, the one question I have here is "why did it take so long for the WP bgworker to reboot and get connected again?", and not per se "why is the new code much more sensitive to this change?". Note that @arpad-m also notices it with high SK restart rates, which to me implies issues on the side of SK handling, not my PR per se. |
@tristan957 are you working this this week? Let us know if we should reassign it |
Upon further reflection, I have already dedicated around 5-6 hours to this issue, and I don't think I'm going to make much progress. I'm not saying that the core reason isn't a storage problem. It might be that in the end, but even if that's the case, it would be useful to know which exact expectation on the storage is violated that the compute has. Then I can address it from the storage side. We seem to stall in an SQL query. In general, when we live-reconfigure safekeepers, it seems strange to me that we do any WAL logged changes, like SQL writes, but idk. |
Arseny to look into this once #9915 is reviewed, to give a general sense of whether this area of code looks problematic |
Tried to reproduce locally without success so far, looks like need to debug it on CI, unfortunately. |
On latest main, 6ad9982, I can't reproduce it either. Checking out a commit from Nov 20 (2d6bf17) on which day I did my original reproduction on, I do get a reproduction with a command like (most of the stuff is just to compile):
(I added the changes pasted in my original reproduction to run 16 copies of the test in parallel) |
I can reproduce it locally only when running 32 or better 64 test instances in parallel (e.g using snippet above), causing CPU starvation. Sequentially reconfigure always takes about 1s as expected without #9915 , and whole test takes about 6s. In CI test fails in about ~10% of runs. I don't see any particular misbehavior, so inclined to just increase the timeout -- probably CI runners are overloaded with current parallelism of 12 and 30s is not enough. https://neondb.slack.com/archives/C059ZC138NR/p1733920416393659 |
Seems like 30s sometimes not enough when CI runners are overloaded, causing pull_timeline flakiness. ref #9731 (comment)
…10088) Seems like 30s sometimes not enough when CI runners are overloaded, causing pull_timeline flakiness. ref #9731 (comment)
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9646/11788052818/index.html#testresult/5a4b2cac9be55bb0/
The text was updated successfully, but these errors were encountered: