Fix flaky pageserver restarts in tests #2261

bojanserafimov · 2022-08-12T13:24:02Z

Resolves #2247

This reverts commit 21089d5.

bojanserafimov · 2022-08-12T14:03:42Z

Not sure why my "wait for pid" solution was failing consistently on the same cases. It passes locally. Since we don't run CI on branches anymore, I'll make a few commits to this PR to test what works in CI and ask for re-review.

This reverts commit fd6a712.

hlinnaka · 2022-08-12T17:58:29Z

We have the same pattern to wait for process death at safekeeper shutdown, too. I presume it also needs to be fixed.

bojanserafimov · 2022-08-12T18:02:47Z

We have the same pattern to wait for process death at safekeeper shutdown, too. I presume it also needs to be fixed.

Sure. But first I need to see why it's consistently timing out. 100 seconds should be enough, but half the tests fail (on CI only)

bojanserafimov · 2022-08-16T03:17:28Z

The issue: Now that I'm properly waiting for pageserver to die, it refuses to die, even 100 seconds after SIGKILL. This is strange, because:

It only happens on CI
Somehow the pageserver dies correctly 99% of the time when we're not waiting for it to die.. How?

My theory: In handle_pagerequests we rely on pgb.read_message() (which is just TcpStream::read() in a trench coat) to occasionally time out and send a WouldBlock error, instead of blocking. But the socket is configured with set_nonblocking(false), so it's up to the OS/hardware to decide whether to block or not. The OS/hardware could be different on my laptop, on the old runners, and on the new runners.

If the OS/hardware blocks, then the pageserver would be stuck in kernel space waiting on some network syscall. Since SIGKILL basically means "next time u in user mode, pls die", pageserver will hang forever.. Unless, we continue to the next test without waiting for it, and start a new NeonEnv, which starts a compute node, which sends some request and unblocks the old pageserver, allowing it to die just in time for the new one to start. Sometimes the timing is not right and we get a timeout on trying to lock the pid file. That would explain why not waiting for the pageserver seems to work better.

@LizardWizzard It's an elaborate theory, might be wrong, but I'm out of simple ones. Trying to find a way to check if any of this is in the right direction.

@zoete is there an easy way to test this PR on the old runners, just to see if the issue is OS/platform behavior? Another theory suggested by @LizardWizzard is that we could be hitting a bug like this, so that's worth checking too

knizhnik · 2022-08-16T06:18:54Z

Blocking system call with slow device (i.e. socket) should be interrupted by signal: EINTR error should be returned by system call in this case.

LizardWizzard · 2022-08-16T11:13:18Z

This is the output from runner machine:

sh-4.2$ ps aux | grep pageserver
ec2-user   511  0.0  0.0      0     0 ?        Zs   10:56   0:00 [pageserver] <defunct>
ec2-user   513  0.5  0.0      0     0 ?        Z    10:56   0:02 [pageserver] <defunct>
ec2-user   947  0.0  0.0      0     0 ?        Zs   10:57   0:00 [pageserver] <defunct>
ec2-user   948 26.9  0.0      0     0 ?        Z    10:57   2:12 [pageserver] <defunct>
ec2-user  2511  0.0  0.0      0     0 ?        Zs   10:58   0:00 [pageserver] <defunct>
ec2-user  2512  0.0  0.0      0     0 ?        Z    10:58   0:00 [pageserver] <defunct>
ec2-user  2656  0.0  0.0      0     0 ?        Zs   10:59   0:00 [pageserver] <defunct>
ec2-user  2657  0.2  0.0      0     0 ?        Z    10:59   0:00 [pageserver] <defunct>
ec2-user  2776  0.0  0.0      0     0 ?        Zs   11:00   0:00 [pageserver] <defunct>
ec2-user  2777  0.2  0.0      0     0 ?        Z    11:00   0:00 [pageserver] <defunct>
ec2-user  2970  0.0  0.0      0     0 ?        Zs   11:00   0:00 [pageserver] <defunct>
ec2-user  2971  0.0  0.0      0     0 ?        Z    11:00   0:00 [pageserver] <defunct>
ec2-user  3096  0.0  0.0      0     0 ?        Zs   11:00   0:00 [pageserver] <defunct>
ec2-user  3097  1.6  0.0      0     0 ?        Z    11:00   0:04 [pageserver] <defunct>
ec2-user  3280  0.0  0.0      0     0 ?        Zs   11:01   0:00 [pageserver] <defunct>
ec2-user  3281  0.1  0.0      0     0 ?        Z    11:01   0:00 [pageserver] <defunct>
ec2-user  3409  0.0  0.0      0     0 ?        Zs   11:02   0:00 [pageserver] <defunct>
ec2-user  3410  0.1  0.0      0     0 ?        Z    11:02   0:00 [pageserver] <defunct>
ec2-user  3544  0.0  0.0      0     0 ?        Zs   11:03   0:00 [pageserver] <defunct>
ec2-user  3545  0.1  0.0      0     0 ?        Z    11:03   0:00 [pageserver] <defunct>
ec2-user  3673  0.0  0.1 140664 29200 ?        Sl   11:03   0:00 /tmp/neon/bin/neon_local pageserver stop -m immediate
ec2-user  3681  0.0  0.0      0     0 ?        Zs   11:03   0:00 [pageserver] <defunct>
ec2-user  3682  0.2  0.0      0     0 ?        Z    11:03   0:00 [pageserver] <defunct>
ec2-user  3821  0.0  0.1 140664 28980 ?        Sl   11:03   0:00 /tmp/neon/bin/neon_local pageserver stop -m immediate
ec2-user  3833  0.0  0.1 140664 29160 ?        Sl   11:04   0:00 /tmp/neon/bin/neon_local pageserver stop -m immediate
ec2-user  3847  0.0  0.1 140664 29088 ?        Sl   11:04   0:00 /tmp/neon/bin/neon_local pageserver stop -m immediate
ssm-user  3853  0.0  0.0 121280   992 pts/0    S+   11:05   0:00 grep pageserver
ec2-user 30235  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>
ec2-user 30236  0.0  0.0      0     0 ?        Z    10:52   0:00 [pageserver] <defunct>
ec2-user 30284  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>
ec2-user 30285  1.0  0.0      0     0 ?        Z    10:52   0:08 [pageserver] <defunct>
ec2-user 30344  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>
ec2-user 30345  0.5  0.0      0     0 ?        Z    10:52   0:03 [pageserver] <defunct>
ec2-user 30439  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>
ec2-user 30440  0.5  0.0      0     0 ?        Z    10:52   0:03 [pageserver] <defunct>
ec2-user 30810  0.0  0.0      0     0 ?        Zs   10:54   0:00 [pageserver] <defunct>
ec2-user 30811  0.0  0.0      0     0 ?        Z    10:54   0:00 [pageserver] <defunct>
ec2-user 30918  0.0  0.0      0     0 ?        Zs   10:54   0:00 [pageserver] <defunct>
ec2-user 30919  8.0  0.0      0     0 ?        Z    10:54   0:50 [pageserver] <defunct>
ec2-user 30985  0.0  0.0      0     0 ?        Zs   10:54   0:00 [pageserver] <defunct>
ec2-user 30986  4.8  0.0      0     0 ?        Z    10:54   0:30 [pageserver] <defunct>
ec2-user 31205  0.0  0.0      0     0 ?        Zs   10:54   0:00 [pageserver] <defunct>
ec2-user 31206  2.4  0.0      0     0 ?        Z    10:54   0:15 [pageserver] <defunct>
ec2-user 32087  0.0  0.0      0     0 ?        Zs   10:56   0:00 [pageserver] <defunct>
ec2-user 32088 31.7  0.0      0     0 ?        Z    10:56   2:50 [pageserver] <defunct>
ec2-user 32591  0.0  0.0      0     0 ?        Zs   10:56   0:00 [pageserver] <defunct>
ec2-user 32592  0.7  0.0      0     0 ?        Z    10:56   0:03 [pageserver] <defunct>

So it says that pageserver become a zombie process. As tests run the list grows. Continue looking into it

LizardWizzard · 2022-08-16T11:44:09Z

See the same thing for safekeeper and postgres:

ec2-user 30416  0.0  0.0      0     0 ?        Zs   10:52   0:00 [postgres] <defunct>
ec2-user 30439  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>
ec2-user 30440  0.1  0.0      0     0 ?        Z    10:52   0:03 [pageserver] <defunct>
ec2-user 30471  0.0  0.0      0     0 ?        Zs   10:52   0:00 [safekeeper] <defunct>
ec2-user 30472  0.0  0.0      0     0 ?        Z    10:52   0:00 [safekeeper] <defunct>
ec2-user 30490  0.0  0.0      0     0 ?        Zs   10:52   0:00 [safekeeper] <defunct>
ec2-user 30491  0.0  0.0      0     0 ?        Z    10:52   0:00 [safekeeper] <defunct>
ec2-user 30509  0.0  0.0      0     0 ?        Zs   10:52   0:00 [safekeeper] <defunct>
ec2-user 30510  0.0  0.0      0     0 ?        Z    10:52   0:00 [safekeeper] <defunct>

Process hierarchy for one of the pageserver processes looks like that:

sh-4.2$ pstree -aps 30235
systemd,1 --switched-root --system --deserialize 21
  └─containerd-shim,29203 -namespace moby -id 36c862f3a47133d56665a35f61f110842f466a07d5c4d8a760d408e4ff937c23 -address /run/containerd/containerd.sock
      └─tail,29225 -f /dev/null
          └─(pageserver,30235)

ec2-user 30235  0.0  0.0      0     0 ?        Zs   10:52   0:00 [pageserver] <defunct>

~~Have no idea where this tail -f /dev/null process comes from.~~ May be some hack to keep container running

It is coming from our rustlegacy image:

CONTAINER ID   IMAGE                                                                                                               COMMAND              
36c862f3a471   369495373322.dkr.ecr.eu-central-1.amazonaws.com/rustlegacy:2746987948   "tail -f /dev/null"

zoete · 2022-08-16T11:58:29Z

I added gen2 runners today, who no longer use rustlegacy image - I will also shuffle the runners. Maybe that will fix this issue

LizardWizzard · 2022-08-16T13:42:17Z

Hmm, I looked at the runner when it run just the build without actual tests and saw zombies for mold and cachepot

Output

lsh-4.2$ ps aux | grep defunct
ec2-user  7406  0.0  0.0      0     0 ?        Z    13:23   0:00 [cachepot] <defunct>
ec2-user  7407  0.0  0.0      0     0 ?        Zs   13:23   0:00 [cachepot] <defunct>
ec2-user  7753  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7776  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7794  0.4  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7829  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7830  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7848  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  7995  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8027  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8163  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8277  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8318  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8361  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8632  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8666  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  8788  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  9141  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  9232  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  9453  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user  9673  0.4  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10155  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10186  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10227  0.1  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10237  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10315  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10682  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 10938  0.4  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11143  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11241  0.3  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11407  0.6  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11669  0.3  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11820  0.2  0.0      0     0 ?        Z    13:23   0:00 [mold] <defunct>
ec2-user 11976  0.2  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12056  0.5  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12062  0.2  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12144  0.2  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12163  0.3  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12165  0.2  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12411  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12424  0.6  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12545  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12748  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12834  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12852  1.8  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12936  1.1  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12937  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 12993  1.0  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13076  0.5  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13087  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13141  0.4  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13203  0.3  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13257  0.5  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13380  3.8  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13486  1.7  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 13576  1.8  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 14174  1.2  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ec2-user 14414  6.3  0.0      0     0 ?        Z    13:24   0:00 [mold] <defunct>
ssm-user 14987  0.0  0.0 121280   976 pts/0    R+   13:24   0:00 grep defunct

So I think it has something to do with the runner and how it launches tasks and not with our test suite

Do you have any ideas @zoete? I have little knowledge about runner internals/tasks scheduling. Based on what I've seen I think it runs the image with tail -f /dev/null and then uses docker exec to run pipeline steps. But it is still unclear how this might lead to zombies

LizardWizzard · 2022-08-16T16:09:58Z

Kudos to @zoete :) It seems that #2289 fixed the issue. There was only one zombie proxy at the end of the test run and two moto instances appeared during shutdown compared to more than thousand zombies without the fix. No postgres/safekeepr/pageserver processes among zombies anymore

…ixture-restart

LizardWizzard

LGTM

* Check for entire range during sasl validation (#2281) * Gen2 GH runner (#2128) * Re-add rustup override * Try s3 bucket * Set git version * Use v4 cache key to prevent problems * Switch to v5 for key * Add second rustup fix * Rebase * Add kaniko steps * Fix typo and set compress level * Disable global run default * Specify shell for step * Change approach with kaniko * Try less verbose shell spec * Add submodule pull * Add promote step * Adjust dependency chain * Try default swap again * Use env * Don't override aws key * Make kaniko build conditional * Specify runs on * Try without dependency link * Try soft fail * Use image with git * Try passing to next step * Fix duplicate * Try other approach * Try other approach * Fix typo * Try other syntax * Set env * Adjust setup * Try step 1 * Add link * Try global env * Fix mistake * Debug * Try other syntax * Try other approach * Change order * Move output one step down * Put output up one level * Try other syntax * Skip build * Try output * Re-enable build * Try other syntax * Skip middle step * Update check * Try first step of dockerhub push * Update needs dependency * Try explicit dir * Add missing package * Try other approach * Try other approach * Specify region * Use with * Try other approach * Add debug * Try other approach * Set region * Follow AWS example * Try github approach * Skip Qemu * Try stdin * Missing steps * Add missing close * Add echo debug * Try v2 endpoint * Use v1 endpoint * Try without quotes * Revert * Try crane * Add debug * Split steps * Fix duplicate * Add shell step * Conform to options * Add verbose flag * Try single step * Try workaround * First request fails hunch * Try bullseye image * Try other approach * Adjust verbose level * Try previous step * Add more debug * Remove debug step * Remove rogue indent * Try with larger image * Add build tag step * Update workflow for testing * Add tag step for test * Remove unused * Update dependency chain * Add ownership fix * Use matrix for promote * Force update * Force build * Remove unused * Add new image * Add missing argument * Update dockerfile copy * Update Dockerfile * Update clone * Update dockerfile * Go to correct folder * Use correct format * Update dockerfile * Remove cd * Debug find where we are * Add debug on first step * Changedir to postgres * Set workdir * Use v1 approach * Use other dependency * Try other approach * Try other approach * Update dockerfile * Update approach * Update dockerfile * Update approach * Update dockerfile * Update dockerfile * Add workspace hack * Update Dockerfile * Update Dockerfile * Update Dockerfile * Change last step * Cleanup pull in prep for review * Force build images * Add condition for latest tagging * Use pinned version * Try without name value * Remove more names * Shorten names * Add kaniko comments * Pin kaniko * Pin crane and ecr helper * Up one level * Switch to pinned tag for rust image * Force update for test Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box> * Add missing step output, revert one deploy step (#2285) * Add missing step output, revert one deploy step * Conform to syntax * Update approach * Add missing value * Add missing needs Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Error for fatal not git repo (#2286) Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Use main, not branch for ref check (#2288) * Use main, not branch for ref check * Add more debug * Count main, not head * Try new approach * Conform to syntax * Update approach * Get full history * Skip checkout * Cleanup debug * Remove more debug Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Fix docker zombie process issue (#2289) * Fix docker zombie process issue * Init everywhere Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Fix 1.63 clippy lints (#2282) * split out timeline metrics, track layer map loading and size calculation * reset rust cache for clippy run to avoid an ICE additionally remove trailing whitespaces * Rename pg_control_ffi.h to bindgen_deps.h, for clarity. The pg_control_ffi.h name implies that it only includes stuff related to pg_control.h. That's mostly true currently, but really the point of the file is to include everything that we need to generate Rust definitions from. * Make local mypy behave like CI mypy (#2291) * Fix flaky pageserver restarts in tests (#2261) * Remove extra type aliases (#2280) * Update cachepot endpoint (#2290) * Update cachepot endpoint * Update dockerfile & remove env * Update image building process * Cannot use metadata endpoint for this * Update workflow * Conform to kaniko syntax * Update syntax * Update approach * Update dockerfiles * Force update * Update dockerfiles * Update dockerfile * Cleanup dockerfiles * Update s3 test location * Revert s3 experiment * Add more debug * Specify aws region * Remove debug, add prefix * Remove one more debug Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * workflows/benchmarking: increase timeout (#2294) * Rework `init` in pageserver CLI (#2272) * Do not create initial tenant and timeline (adjust Python tests for that) * Rework config handling during init, add --update-config to manage local config updates * Fix: Always build images (#2296) * Always build images * Remove unused Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Move auto-generated 'bindings' to a separate inner module. Re-export only things that are used by other modules. In the future, I'm imagining that we run bindgen twice, for Postgres v14 and v15. The two sets of bindings would go into separate 'bindings_v14' and 'bindings_v15' modules. Rearrange postgres_ffi modules. Move function, to avoid Postgres version dependency in timelines.rs Move function to generate a logical-message WAL record to postgres_ffi. * fix cargo test * Fix walreceiver and safekeeper bugs (#2295) - There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver. - IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver. - There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic. - There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous. - There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts. - Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors. New retry logic works like this: - Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment - When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect. - When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly. * on safekeeper registration pass availability zone param (#2292) Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com> Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Anton Galitsyn <agalitsyn@users.noreply.github.com>

bojanserafimov added 3 commits August 12, 2022 09:01

Reproduce pageserver.pid lock on restart issue

4142797

Add hacky solution

bd33ea9

Wait for pid death

21089d5

bojanserafimov requested a review from knizhnik August 12, 2022 13:28

knizhnik approved these changes Aug 12, 2022

View reviewed changes

Revert "Wait for pid death"

4fbd751

This reverts commit 21089d5.

bojanserafimov added 6 commits August 12, 2022 10:47

Wait for pid in control_plane

2d2fe3a

Check every 100 ms

398fb3c

Add time limit

476d26c

Temporarily disable profiler exit

fd6a712

Revert "Temporarily disable profiler exit"

68a77e0

This reverts commit fd6a712.

Increase limit

2249687

Handle other errors, measure shutdown time

79a1b81

hlinnaka mentioned this pull request Aug 12, 2022

test_import_from_pageserver_multisegment consistently failed #2255

Open

bojanserafimov added 2 commits August 12, 2022 15:20

TMP force early kill

e9e2551

TMP Use sigkill

86b9170

bojanserafimov linked an issue Aug 15, 2022 that may be closed by this pull request

Fix race in neon_local pageserver stop #1840

Closed

LizardWizzard mentioned this pull request Aug 16, 2022

Dkr/zombie #2287

Closed

bojanserafimov added 2 commits August 16, 2022 12:16

Merge branch 'main' into fixture-restart

34bc3f6

Merge branch 'fixture-restart' of github.com:neondatabase/neon into f…

1f8161b

…ixture-restart

bojanserafimov temporarily deployed to dev August 16, 2022 16:22 Inactive

bojanserafimov temporarily deployed to dev August 16, 2022 16:36 Inactive

Revert debugging hacks

9e69b73

bojanserafimov had a problem deploying to dev August 16, 2022 16:42 Error

Fix typo

bffa101

bojanserafimov temporarily deployed to dev August 16, 2022 16:44 Inactive

Revert timeout duration change

f4a1771

bojanserafimov temporarily deployed to dev August 16, 2022 17:01 Inactive

Aplpy same change to safekeeper stop

4133321

bojanserafimov temporarily deployed to dev August 16, 2022 17:09 Inactive

bojanserafimov temporarily deployed to dev August 16, 2022 17:23 Inactive

bojanserafimov requested review from knizhnik and LizardWizzard August 16, 2022 17:43

LizardWizzard approved these changes Aug 17, 2022

View reviewed changes

bojanserafimov merged commit e9a3499 into main Aug 17, 2022

bojanserafimov deleted the fixture-restart branch August 17, 2022 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky pageserver restarts in tests #2261

Fix flaky pageserver restarts in tests #2261

bojanserafimov commented Aug 12, 2022

bojanserafimov commented Aug 12, 2022

hlinnaka commented Aug 12, 2022

bojanserafimov commented Aug 12, 2022 •

edited

Loading

bojanserafimov commented Aug 16, 2022

knizhnik commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022 •

edited

Loading

zoete commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard left a comment

Fix flaky pageserver restarts in tests #2261

Fix flaky pageserver restarts in tests #2261

Conversation

bojanserafimov commented Aug 12, 2022

bojanserafimov commented Aug 12, 2022

hlinnaka commented Aug 12, 2022

bojanserafimov commented Aug 12, 2022 • edited Loading

bojanserafimov commented Aug 16, 2022

knizhnik commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022 • edited Loading

zoete commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard commented Aug 16, 2022

LizardWizzard left a comment

Choose a reason for hiding this comment

bojanserafimov commented Aug 12, 2022 •

edited

Loading

LizardWizzard commented Aug 16, 2022 •

edited

Loading