[Bug]: Periodic background worker crash and DB restart #4266

abrownsword · 2022-04-22T15:52:47Z

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

2-3 times per day my database logs this and then goes into recovery mode and restarts:

2022-04-15 15:15:24.112 UTC [11443] WARNING:  failed to launch job 1005 "Refresh Continuous Aggregate Policy [1005]": out of background workers
2022-04-15 15:15:24.112 UTC [11443] WARNING:  failed to launch job 1007 "Refresh Continuous Aggregate Policy [1007]": out of background workers
2022-04-15 15:15:24.117 UTC [11443] WARNING:  failed to launch job 1007 "Refresh Continuous Aggregate Policy [1007]": out of background workers
2022-04-15 15:15:24.122 UTC [30121] LOG:  refreshing continuous aggregate "environ_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.131 UTC [30124] LOG:  refreshing continuous aggregate "vsensor_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.131 UTC [30122] LOG:  refreshing continuous aggregate "monitor_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.132 UTC [30123] LOG:  refreshing continuous aggregate "state_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.527 UTC [30125] LOG:  refreshing continuous aggregate "power_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:42:02.743 UTC [2442] LOG:  received fast shutdown request
2022-04-15 15:42:02.746 UTC [2442] LOG:  aborting any active transactions
terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.747 UTC [11443] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2022-04-15 15:42:02.747 UTC [30248] grafana@grafana FATAL:  terminating connection due to administrator command
2022-04-15 15:42:02.756 UTC [11448] mcp@mcp FATAL:  terminating connection due to administrator command
2022-04-15 15:42:02.761 UTC [11445] LOG:  terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.761 UTC [11445] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
terminating TimescaleDB background worker launcher due to administrator command2022-04-15 15:42:02.777 UTC [11441] FATAL:  terminating background worker "TimescaleDB Background Worker Launcher" due to administrator command
2022-04-15 15:42:02.777 UTC [11445] LOG:  terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.775 UTC [11447] mcp@mcp FATAL:  terminating connection due to administrator command
terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.795 UTC [2442] LOG:  background worker "logical replication launcher" (PID 11442) exited with exit code 1
2022-04-15 15:42:02.809 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 11441) exited with exit code 1
2022-04-15 15:42:02.816 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 11445) exited with exit code 1
2022-04-15 15:42:02.822 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 11443) exited with exit code 1

I have also seen this message:

2022-04-16 09:43:19.396 UTC [31146] LOG:  background worker "Telemetry Reporter [1]" (PID 56968) was terminated by signal 11: Segmentation fault

The failure of the background worker causes a SIGUSR or SIGQUIT signal to restart the backend process, in gdb I haven't captured a useful stack trace as I would need to be debugging the failing background worker and don't know how to set that up.

TimescaleDB version affected

2.6.1

PostgreSQL version used

14.2

What operating system did you use?

Ubuntu 20.04, Linux 4.9.277-122 #1 SMP PREEMPT Mon Feb 28 14:30:14 UTC 2022 aarch64 GNU/Linux

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

See above.  I have been unable to capture a stack trace.

How can we reproduce the bug?

Unfortunately I don't have a simple reproduction case.  My setup is actually a custom program reading from a different database (a much older postgres/timescale installation) and inserting the data into the new one which is experiencing the crash.  I can provide the various pieces, but they are large/unwieldy and it would be a bit complex to setup (and may require running on the single board ARM computer I'm using if this is aarch64 specific).

The text was updated successfully, but these errors were encountered:

horzsolt · 2022-04-25T08:48:34Z

Hi @abrownsword thanks for reaching out. Unfortunately without a stacktrace it is very hard to figure out what happened. As a first step it would be good to understand if the issue happens only when the telemetry bg worker executed? Or there might be other cases when other bg jobs cause the same issue?

abrownsword · 2022-04-25T17:21:18Z

Understood, I've struggled with bug reports that don't have good repro cases or stack traces many times. If you can suggest any way I can capture more info, I'll give it a try.

It appears from my log files that for the past 9 days or so it has only happened about twice a day, and only due to the Telemetry Worker failure.

konskov · 2022-04-26T11:58:26Z

Just to rule out the probability that it is a network issue, could you confirm that your network connection was stable over the past 9 days? The telemetry job depends on the network, so we'd just like to make sure it's not an error there.

abrownsword · 2022-04-26T13:05:10Z

It should have been -- is there a particular log I can look at that will note instability events?

konskov · 2022-04-28T11:18:37Z

Hi @abrownsword , sorry for the delay getting back to you. As far as I know, network failures are not logged by postgres. We are trying to look more deeply into your issue so I’d like to ask for some more information you can hopefully provide:

does the segfault you are experiencing also occur 2-3 times a day?
Would you be able to get a coredump?

abrownsword · 2022-04-28T11:59:41Z

Yes, the telemetry reporter failure seems to be the semi-regular event causing the DB to restart.

I'll figure out how to get core dumps turned on.

abrownsword · 2022-05-02T17:53:28Z

Okay, I managed to get a core dump... here is the stack trace. Is there anything you'd like from it?

Core was generated by 'postgres: 14/main: Telemetry Reporter [1]'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000005567b2d944 in index_rescan (scan=0x5573487ac8, keys=0x7fed4ecbb0, nkeys=1, orderbys=0x0, norderbys=0)
    at ./build/../src/backend/access/index/indexam.c:302
302	./build/../src/backend/access/index/indexam.c: No such file or directory.
(gdb) bt
#0  0x0000005567b2d944 in index_rescan (scan=0x5573487ac8, keys=0x7fed4ecbb0, nkeys=1, orderbys=0x0, norderbys=0)
    at ./build/../src/backend/access/index/indexam.c:302
#1  0x0000007f2fb1ff14 in ts_scanner_rescan (ctx=0x589, ctx@entry=0x7fed4ecac8, scankey=0x55681d7500 <TTSOpsBufferHeapTuple>)
    at /home/andrew/timescaledb/src/scanner.c:183
#2  0x0000007f2fb20b2c in ts_scan_iterator_rescan (iterator=0x589, iterator@entry=0x7fed4ecac8)
    at /home/andrew/timescaledb/src/scan_iterator.c:59
#3  0x0000007f2fb2befc in ts_scan_iterator_start_or_restart_scan (iterator=0x7fed4ecac8) at /home/andrew/timescaledb/src/scan_iterator.h:121
#4  get_chunk_compression_stats (statsctx=0x7fed4ecac0, chunk=0x0, compr_stats=0x7fed4eca70)
    at /home/andrew/timescaledb/src/telemetry/stats.c:350
#5  process_chunk (statsctx=0x7fed4ecac0, chunk_reltype=RELTYPE_CONTINUOUS_AGG, class=0x7f31d1d990, chunk=0x0)
    at /home/andrew/timescaledb/src/telemetry/stats.c:417
#6  ts_telemetry_stats_gather (stats=0x7fed4ecd98) at /home/andrew/timescaledb/src/telemetry/stats.c:562
#7  0x0000007f2fb2d068 in build_telemetry_report () at /home/andrew/timescaledb/src/telemetry/telemetry.c:426
#8  0x0000007f2fb2cb68 in ts_build_version_request (host=0x7f2fb49d3e "telemetry.timescale.com", path=0x7f2fb49d56 "/v1/metrics")
    at /home/andrew/timescaledb/src/telemetry/telemetry.c:560
#9  ts_telemetry_main (host=0x7f2fb49d3e "telemetry.timescale.com", path=0x7f2fb49d56 "/v1/metrics", service=0x7f2fb49cd7 "https")
    at /home/andrew/timescaledb/src/telemetry/telemetry.c:638
#10 0x0000007f2fb27004 in ts_bgw_job_run_and_set_next_start (job=0x55733c7d68, initial_runs=12, next_interval=0x7fed4ed550, func=<optimized out>)
    at /home/andrew/timescaledb/src/bgw/job.c:971
#11 ts_bgw_job_execute (job=0x55733c7d68) at /home/andrew/timescaledb/src/bgw/job.c:761
#12 0x0000007f2fb272e8 in ts_bgw_job_entrypoint (fcinfo=<optimized out>) at /home/andrew/timescaledb/src/bgw/job.c:877
#13 0x0000005567d69490 in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:875
#14 0x0000005567d76074 in do_start_bgworker (rw=0x55733c3b30) at ./build/../src/backend/postmaster/postmaster.c:5866
#15 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:6091
#16 0x0000005567d769c0 in sigusr1_handler (postgres_signal_arg=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:5255
#17 <signal handler called>
#18 0x0000007f7853836c in __GI___select (nfds=nfds@entry=9, readfds=readfds@entry=0x7fed4eed90, writefds=writefds@entry=0x0, 
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fed4eece8) at ../sysdeps/unix/sysv/linux/select.c:53
#19 0x0000005567d76fb4 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1709
#20 0x0000005567d788a4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1417
#21 0x0000005567ad7290 in main (argc=5, argv=0x557336fe40) at ./build/../src/backend/main/main.c:209

erimatnor · 2022-05-11T12:13:54Z

@abrownsword Thank you for your efforts to debug this and generating the stack trace. I am seeing some odd things in the stack trace but might need your help in tracking those things down. Specifically, if you look at frame 5, the chunk_reltype=RELTYPE_CONTINUOUS_AGG shouldn't be possible if you read the source code for where that process_chunk function is called. Notice the required cases/values of chunk_reltype needed to call the function. So, I am suspecting either memory corruption or something odd with the stack trace.

First, if you can share the core dump file somehow it would help. Second, if you could give us some information on the tables and other objects you have in your database. How many hypertables, distributed hypertables, caggs, etc. Do they have compression enabled with compressed chunks? Do you have a compressed cagg? This might help us reconstruct an environment that can reproduce the bug.

abrownsword · 2022-05-11T13:39:30Z

I'm happy to help. This db is running on an AArch64 platform, so I'm not sure that simply providing the core dump will actually be useful to you..?

Let's discuss options for all this via email instead of on the issue. Can I contact you directly somehow?

@abrownsword

This change makes two changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

This change makes changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

This change makes changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266

@abrownsword

Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes #4266

abrownsword added the bug label Apr 22, 2022

mkindahl added segfault Segmentation fault bgw The background worker subsystem, including the scheduler labels Apr 25, 2022

horzsolt added the waiting-for-author label Apr 25, 2022

konskov self-assigned this Apr 26, 2022

konskov removed their assignment May 3, 2022

erimatnor self-assigned this May 3, 2022

jfjoly added this to the TimescaleDB 2.7 milestone May 11, 2022

josesahad added the ship-stopper label May 16, 2022

erimatnor mentioned this issue May 19, 2022

Fix crash and other issues in telemetry reporter #4358

Merged

erimatnor closed this as completed in #4358 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Periodic background worker crash and DB restart #4266

[Bug]: Periodic background worker crash and DB restart #4266

abrownsword commented Apr 22, 2022 •

edited by mkindahl

Loading

horzsolt commented Apr 25, 2022

abrownsword commented Apr 25, 2022

konskov commented Apr 26, 2022

abrownsword commented Apr 26, 2022

konskov commented Apr 28, 2022

abrownsword commented Apr 28, 2022

abrownsword commented May 2, 2022 •

edited by konskov

Loading

erimatnor commented May 11, 2022

abrownsword commented May 11, 2022

[Bug]: Periodic background worker crash and DB restart #4266

[Bug]: Periodic background worker crash and DB restart #4266

Comments

abrownsword commented Apr 22, 2022 • edited by mkindahl Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

horzsolt commented Apr 25, 2022

abrownsword commented Apr 25, 2022

konskov commented Apr 26, 2022

abrownsword commented Apr 26, 2022

konskov commented Apr 28, 2022

abrownsword commented Apr 28, 2022

abrownsword commented May 2, 2022 • edited by konskov Loading

erimatnor commented May 11, 2022

abrownsword commented May 11, 2022

abrownsword commented Apr 22, 2022 •

edited by mkindahl

Loading

abrownsword commented May 2, 2022 •

edited by konskov

Loading