Bug: `walreceiver` did not restart after erroring out #8172

kelvich · 2024-06-26T14:48:55Z

Got an interesting case with one of the production read-only endpoints. Walreceiver errored out and died:

2024-06-18 14:09:16.961	 {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 14:09:16.598 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=53100 [493] FATAL:  could not write to file \"pg_wal/xlogtemp.493\": No space left on device"}
2024-06-18 08:22:29.288	{"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:29.127 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  skipping missing configuration file \"/var/db/postgres/compute/pgdata/compute_ctl_temp_override.conf\""}
2024-06-18 08:22:24.446	{"app":"NeonVM","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:24.347 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  started streaming WAL from primary at 3/49000000 on timeline 1"}

but then it did not start again.

https://neondb.slack.com/archives/C04DGM6SMTM/p1719394592373479
https://console.neon.tech/admin/regions/aws-eu-central-1/computes/compute-lingering-forest-a2yogi5o

Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walreceiver.

The text was updated successfully, but these errors were encountered:

knizhnik · 2024-07-03T18:22:12Z

Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walreceiver.

Did it. But the problem is not reproduced: walreceiver is restarted.
Also please notice that in case of No space left on device error and write WAL failure, Postgres panics:

			ereport(PANIC,
					(errcode_for_file_access(),
					 errmsg("could not write to WAL segment %s "
							"at offset %u, length %lu: %m",
							xlogfname, startoff, (unsigned long) segbytes)));

which should cause termination of the whole VM (not sure if k8s will restart).

knizhnik · 2024-07-04T09:04:58Z

I wonder if there is any proof that walreceiver is actually died and not restarted?
As far as I understand symptoms are the following: we have active but lagged replica.
I wonder if walreceiver process is absent or there is some other proof that it failed to restart?
May be it is just locked or waits for something (from SK for example)?

I looked through postmaster code but didn't find some obvious explanation which can prevent crashed walreceiver from been restarted.

kelvich · 2024-07-04T11:39:23Z

i manually checked that there were no walreceiver running on replica, here is ps output https://neondb.slack.com/archives/C04DGM6SMTM/p1719401779142989?thread_ts=1719394592.373479&cid=C04DGM6SMTM

knizhnik · 2024-07-04T15:56:44Z

I failed to reproduce the problem by throwing FATAL exception in walreceiver (I tried different places and frequency).
May be it is somehow related of out-of-disk space which makes it not possible to spawn new process (for example it tries to allocate some file, failed and not spawned)? Frankly speaking I do not believe I this hypothesis because I expect that some error should be reported and present in Postgres log in this case.

ololobus · 2024-07-16T15:41:36Z

We also didn't notice that in prod for a long time, but keeping it open for now

kelvich added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Jun 26, 2024

kelvich assigned knizhnik Jun 26, 2024

kelvich mentioned this issue Jun 26, 2024

Epic: stabilize physical replication #6211

Open

kelvich changed the title ~~Bug: walreceiver did not restart after erroring our~~ Bug: walreceiver did not restart after erroring out Jul 1, 2024

ololobus unassigned knizhnik Jul 30, 2024

ololobus closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `walreceiver` did not restart after erroring out #8172

Bug: `walreceiver` did not restart after erroring out #8172

kelvich commented Jun 26, 2024 •

edited by knizhnik

Loading

knizhnik commented Jul 3, 2024

knizhnik commented Jul 4, 2024

kelvich commented Jul 4, 2024

knizhnik commented Jul 4, 2024

ololobus commented Jul 16, 2024 •

edited

Loading

Bug: walreceiver did not restart after erroring out #8172

Bug: walreceiver did not restart after erroring out #8172

Comments

kelvich commented Jun 26, 2024 • edited by knizhnik Loading

knizhnik commented Jul 3, 2024

knizhnik commented Jul 4, 2024

kelvich commented Jul 4, 2024

knizhnik commented Jul 4, 2024

ololobus commented Jul 16, 2024 • edited Loading

Bug: `walreceiver` did not restart after erroring out #8172

Bug: `walreceiver` did not restart after erroring out #8172

kelvich commented Jun 26, 2024 •

edited by knizhnik

Loading

ololobus commented Jul 16, 2024 •

edited

Loading