Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability in test_pull_timeline #9731

Open
jcsp opened this issue Nov 12, 2024 · 14 comments
Open

Instability in test_pull_timeline #9731

jcsp opened this issue Nov 12, 2024 · 14 comments
Assignees

Comments

@jcsp
Copy link
Collaborator

jcsp commented Nov 12, 2024

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9646/11788052818/index.html#testresult/5a4b2cac9be55bb0/

command failed: error sending request for url (http://127.0.0.1:17120/configure)
@arpad-m
Copy link
Member

arpad-m commented Nov 20, 2024

I've went over the seven runs where it failed in the last 14 days, and they all failed on the same reconfiguration request. Log excerpt:

2024-11-08 20:03:27.644 INFO [test_wal_acceptor.py:1848] reconfiguing compute with new config to verify that it works
2024-11-08 20:03:27.644 INFO [neon_cli.py:72] Running command "/tmp/neon/bin/neon_local endpoint reconfigure ep-1 --tenant-id 0b0b1497dff3937d1002c5fbbb58e6ab --safekeepers 1,3,4"
2024-11-08 20:03:57.793 INFO [neon_cli.py:137] Run ['/tmp/neon/bin/neon_local', 'endpoint', 'reconfigure', 'ep-1', '--tenant-id', '0b0b1497dff3937d1002c5fbbb58e6ab', '--safekeepers', '1,3,4'] failed:
  stdout:
    Getting claims for path debug/v1/tenant/0b0b1497dff3937d1002c5fbbb58e6ab/locate
    Got claims Claims { tenant_id: None, scope: Admin } for path debug/v1/tenant/0b0b1497dff3937d1002c5fbbb58e6ab/locate
  stderr:
    command failed: error sending request for url (http://127.0.0.1:17035/configure)

    Caused by:
        operation timed out

All 7 reproductions were on postgres 14 or 15, and all on release.

@arpad-m
Copy link
Member

arpad-m commented Nov 20, 2024

Looking at the compute logs, it seems that the configure operation takes 30 seconds:

2024-11-08T20:03:27.660149Z  INFO http request{otel.name=/configure http.method=POST}: serving /configure POST request
...
2024-11-08T20:03:57.780342Z  INFO configurator_main_loop: compute node configured

Looking at four other reproductions of the seven, it's always 30 seconds. Maybe some timeout that it's running into?

When it works, the operation is over much more quickly, within a second, say here:

2024-11-08T20:04:08.061696Z  INFO http request{otel.name=/configure http.method=POST}: serving /configure POST request
...
2024-11-08T20:04:09.172912Z  INFO configurator_main_loop: compute node configured

The compute logs for the buggy case show this around the "gap" of 30 seconds. Apparently it tries to upload traces:

PG:2024-11-08 20:03:27.669 GMT [351693] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:17020' with protocol version 2
OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces)
PG:2024-11-08 20:03:57.693 GMT [367892] LOG:  [WP] connecting with node 127.0.0.1:17023

This OpenTelemetry trace error can either be a coincidence, or actually be due to us.

From the logs, it seems to hang in the second self.pg_reload_conf()?; invocation in ComputeNode::reconfigure. This function invokes the pg_ctl binary, and then waits for it to terminate.
We throw away the stdout/stderr from that binary, and we also don't print the stdout/stderr, which might be useful here.

@arpad-m
Copy link
Member

arpad-m commented Nov 20, 2024

Hmm since today Nov 20,2024 11:05 UTC, it seems to be way more flaky:

Image

If I look at the main branch commits, I see a PR having been merged on 11:07 UTC which is safekeeper related: #9364 .

But the occurence from 11:05 (a8c1f67) didn't include it, nor did the second one on 11:08 (e5024a5). Both include #9717 however.

@arpad-m
Copy link
Member

arpad-m commented Nov 21, 2024

I see 23 flaky failures and 364 total runs of the test, but maybe only a third of the tests actually is susceptible to the issue. So one can probably easily reproduce it now.

@arpad-m
Copy link
Member

arpad-m commented Nov 21, 2024

Hmm yeah I really think that #9717 is the culprit, at least for the new issue. it's touched the code that is now printing:

2024-11-21T02:24:17.018130Z  INFO configurator_main_loop:reconfigure:apply_spec_sql: Applying phase CreateSuperUser
2024-11-21T02:24:17.020654Z  INFO configurator_main_loop:reconfigure:apply_spec_sql: Starting phase CreateSuperUser
2024-11-21T02:24:17.020665Z  INFO configurator_main_loop:reconfigure:apply_spec_sql:db_apply_changes{phase=CreateSuperUser}: Processing phase CreateSuperUser
2024-11-21T02:24:17.020675Z  INFO configurator_main_loop:reconfigure:apply_spec_sql:db_apply_changes{phase=CreateSuperUser}: Applying phase CreateSuperUser
PG:2024-11-21 02:24:17.007 GMT [4086451] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:29434' with protocol version 2
PG:2024-11-21 02:24:17.029 GMT [4086451] LOG:  ConsoleURL not set, skipping forwarding
PG:2024-11-21 02:24:17.029 GMT [4086451] STATEMENT:  ​               DO $$​                       DECLARE​                     r text;​                         roles text[] := NULL;​                           dbs text[] := NULL;​                 BEGIN​                       IF NOT EXISTS (​                             SELECT FROM pg_catalog.pg_roles WHERE rolna>
2024-11-21T02:24:20.511160Z  INFO OpenTelemetry error: error sending request for url (http://localhost:4318/v1/traces)
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connecting with node 127.0.0.1:29437
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connecting with node 127.0.0.1:29443
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connecting with node 127.0.0.1:29446
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connected with node 127.0.0.1:29437
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connected with node 127.0.0.1:29443
PG:2024-11-21 02:24:47.034 GMT [4092035] LOG:  [WP] connected with node 127.0.0.1:29446
PG:2024-11-21 02:24:47.034 GMT [4092036] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:29434' with protocol version 2
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] received AcceptorGreeting from safekeeper 127.0.0.1:29443, term=2
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] received AcceptorGreeting from safekeeper 127.0.0.1:29437, term=2
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] proposer connected to quorum (2) safekeepers, propTerm=3
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] requesting vote from 127.0.0.1:29437 for term 3
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] requesting vote from 127.0.0.1:29443 for term 3
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] received AcceptorGreeting from safekeeper 127.0.0.1:29446, term=2
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] requesting vote from 127.0.0.1:29446 for term 3
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:29437, voteGiven=1, epoch=2, flushLsn=0/229C6A8, truncateLsn=0/1BB9638, timelineStartLsn=0/149EF10
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:29443, voteGiven=1, epoch=2, flushLsn=0/229C6A8, truncateLsn=0/1BB9638, timelineStartLsn=0/149EF10
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] got votes from majority (2) of nodes, term 3, epochStartLsn 0/229C6A8, donor 127.0.0.1:29437, truncate_lsn 0/1BB9638
PG:2024-11-21 02:24:47.035 GMT [4092035] LOG:  [WP] sending elected msg to node 1 term=3, startStreamingAt=0/229C6A8 (lastCommonTerm=2), termHistory.n_entries=2 to 127.0.0.1:29437, timelineStartLsn=0/149EF10
PG:2024-11-21 02:24:47.036 GMT [4092035] LOG:  [WP] sending elected msg to node 3 term=3, startStreamingAt=0/229C6A8 (lastCommonTerm=2), termHistory.n_entries=2 to 127.0.0.1:29443, timelineStartLsn=0/149EF10
PG:2024-11-21 02:24:47.036 GMT [4092035] LOG:  [WP] WAL proposer starts streaming at 0/229C6A8
PG:2024-11-21 02:24:47.036 GMT [4092035] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:29446, voteGiven=1, epoch=2, flushLsn=0/229C6A8, truncateLsn=0/1BB9638, timelineStartLsn=0/149EF10
PG:2024-11-21 02:24:47.036 GMT [4092035] LOG:  [WP] sending elected msg to node 4 term=3, startStreamingAt=0/229C6A8 (lastCommonTerm=2), termHistory.n_entries=2 to 127.0.0.1:29446, timelineStartLsn=0/149EF10
2024-11-21T02:24:47.036488Z  INFO configurator_main_loop:reconfigure:apply_spec_sql:db_apply_changes{phase=CreateSuperUser}: successfully executed

you see the 30 second gap is somewhere in the CreateSuperUser phase the PR added. Note that the old code was also applying superuser privileges.

github-merge-queue bot pushed a commit that referenced this issue Nov 21, 2024
Before, `OpenTelemetry` errors were printed to stdout/stderr directly,
causing one of the few log lines without a timestamp, like:

```
OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces)
```

Now, we print:

```
2024-11-21T02:24:20.511160Z  INFO OpenTelemetry error: error sending request for url (http://localhost:4318/v1/traces)
```

I found this while investigating #9731.
@arpad-m
Copy link
Member

arpad-m commented Nov 21, 2024

It's indeed not hard to make it reproduce after the increase in occurences. I change it to run 16 copies of the test, and if I run all 16 in parallel, there is usually one or two that fails.

--- a/test_runner/regress/test_wal_acceptor.py
+++ b/test_runner/regress/test_wal_acceptor.py
@@ -1870,8 +1870,10 @@ def test_delete_timeline_under_load(neon_env_builder: NeonEnvBuilder):
 # Basic pull_timeline test.
 # When live_sk_change is False, compute is restarted to change set of
 # safekeepers; otherwise it is live reload.
-@pytest.mark.parametrize("live_sk_change", [False, True])
-def test_pull_timeline(neon_env_builder: NeonEnvBuilder, live_sk_change: bool):
+@pytest.mark.parametrize("num", range(16))
+def test_pull_timeline(neon_env_builder: NeonEnvBuilder, num: int):
+    live_sk_change = True
+    _num = num
     neon_env_builder.auth_enabled = True

@tristan957 tristan957 assigned tristan957 and unassigned arpad-m Nov 21, 2024
@MMeent
Copy link
Contributor

MMeent commented Nov 21, 2024

I don't think this issue is in my PR, but in the walproposer reconfiguration system:

PG:2024-11-12 00:32:50.089 GMT [361208] FATAL:  [WP] restarting walproposer to change safekeeper list from 127.0.0.1:17108,127.0.0.1:17111,127.0.0.1:17114 to 127.0.0.1:17108,127.0.0.1:17114,127.0.0.1:17117
    ^ Here, the WalProposer terminates to reconfigure the safekeepers from [set A: `:17108`, `:17111`, `:17114`] to [set B: `:17108`, `:17114`, `:17117`]. Note the replacement of 17111 with 17117.
PG:2024-11-12 00:32:50.091 GMT [361191] LOG:  background worker "WAL proposer" (PID 361208) exited with exit code 1
    ^ Here, the postmaster realises the WP bgworker has indeed exited.
2024-11-12T00:32:50.120203Z  INFO configurator_main_loop:reconfigure:handle_roles: postgres roles (total 12): ["cloud_admin:(null)", "pg_database_owner:(null)", "pg_read_all_data:(null)", "pg_write_all_data:(null)", "pg_monitor:(null)", "pg_read_all_settings:(null)", "pg_read_all_stats:(null)", "pg_stat_scan_tables:(null)", "pg_read_server_files:(null)", "pg_write_server_files:(null)", "pg_execute_server_program:(null)", "pg_signal_backend:(null)"]
2024-11-12T00:32:50.144262Z  INFO configurator_main_loop:reconfigure:handle_roles: handling cluster spec roles (total 0)
2024-11-12T00:32:50.145019Z  INFO configurator_main_loop:reconfigure:handle_databases: postgres databases (total 3): ["template1:cloud_admin", "template0:cloud_admin", "postgres:cloud_admin"]
2024-11-12T00:32:50.145523Z  INFO configurator_main_loop:reconfigure:handle_databases: handling cluster spec databases (total 0)
2024-11-12T00:32:50.145570Z  INFO configurator_main_loop:reconfigure:handle_grants: modifying database permissions
2024-11-12T00:32:50.146073Z  INFO configurator_main_loop:reconfigure:handle_extension_neon: handle extension neon
PG:2024-11-12 00:32:50.094 GMT [362843] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:17105' with protocol version 2
OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/v1/traces)
PG:2024-11-12 00:33:20.183 GMT [379533] LOG:  [WP] connecting with node 127.0.0.1:17108
    ^ Here, 30 seconds later, the connection to the new set of safekeepers is first started.
PG:2024-11-12 00:33:20.183 GMT [379533] LOG:  [WP] connecting with node 127.0.0.1:17114
PG:2024-11-12 00:33:20.183 GMT [379533] LOG:  [WP] connecting with node 127.0.0.1:17117
PG:2024-11-12 00:33:20.184 GMT [379533] LOG:  [WP] connected with node 127.0.0.1:17108

Because no commit can get acknowledged without first streaming its WAL to the safekeepers, and those safekeepers not being connected (see the FATAL: [WP] line), no progress is made with starting the endpoint until we have a quorum of safekeepers.

So, the one question I have here is "why did it take so long for the WP bgworker to reboot and get connected again?", and not per se "why is the new code much more sensitive to this change?".

Note that @arpad-m also notices it with high SK restart rates, which to me implies issues on the side of SK handling, not my PR per se.

@jcsp
Copy link
Collaborator Author

jcsp commented Nov 25, 2024

@tristan957 are you working this this week? Let us know if we should reassign it

@tristan957
Copy link
Member

I'm working today and the first half of tomorrow. My understanding after talking with both @arpad-m and @MMeent is that Arpad is going to continue investigating from the storage side. Arpad, please let me know if that is wrong.

@arpad-m
Copy link
Member

arpad-m commented Nov 25, 2024

Upon further reflection, I have already dedicated around 5-6 hours to this issue, and I don't think I'm going to make much progress.
In my opinion this issue falls into the compute team's camp. We call an endpoint on the compute and it doesn't return within the timeout, so we error.

I'm not saying that the core reason isn't a storage problem. It might be that in the end, but even if that's the case, it would be useful to know which exact expectation on the storage is violated that the compute has. Then I can address it from the storage side.

We seem to stall in an SQL query. In general, when we live-reconfigure safekeepers, it seems strange to me that we do any WAL logged changes, like SQL writes, but idk.

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 3, 2024

Arseny to look into this once #9915 is reviewed, to give a general sense of whether this area of code looks problematic

@arssher
Copy link
Contributor

arssher commented Dec 10, 2024

Tried to reproduce locally without success so far, looks like need to debug it on CI, unfortunately.

@arpad-m
Copy link
Member

arpad-m commented Dec 10, 2024

On latest main, 6ad9982, I can't reproduce it either. Checking out a commit from Nov 20 (2d6bf17) on which day I did my original reproduction on, I do get a reproduction with a command like (most of the stuff is just to compile):

git submodule update --init; make -j 32 -s
cargo build_testing --release  && RUST_BACKTRACE=1 BUILD_TYPE=release DEFAULT_PG_VERSION=15 ./scripts/pytest --preserve-database-files -vv --tb=native -n 16 -k test_pull_timeline

(I added the changes pasted in my original reproduction to run 16 copies of the test in parallel)

@arssher
Copy link
Contributor

arssher commented Dec 11, 2024

I can reproduce it locally only when running 32 or better 64 test instances in parallel (e.g using snippet above), causing CPU starvation. Sequentially reconfigure always takes about 1s as expected without #9915 , and whole test takes about 6s. In CI test fails in about ~10% of runs. I don't see any particular misbehavior, so inclined to just increase the timeout -- probably CI runners are overloaded with current parallelism of 12 and 30s is not enough.

https://neondb.slack.com/archives/C059ZC138NR/p1733920416393659

arssher added a commit that referenced this issue Dec 11, 2024
Seems like 30s sometimes not enough when CI runners are overloaded,
causing pull_timeline flakiness.

ref #9731 (comment)
github-merge-queue bot pushed a commit that referenced this issue Dec 11, 2024
…10088)

Seems like 30s sometimes not enough when CI runners are overloaded,
causing pull_timeline flakiness.

ref
#9731 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants