Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Periodic background worker crash and DB restart #4266

Closed
abrownsword opened this issue Apr 22, 2022 · 9 comments · Fixed by #4358
Closed

[Bug]: Periodic background worker crash and DB restart #4266

abrownsword opened this issue Apr 22, 2022 · 9 comments · Fixed by #4358
Assignees
Labels
bgw The background worker subsystem, including the scheduler bug segfault Segmentation fault ship-stopper waiting-for-author

Comments

@abrownsword
Copy link

abrownsword commented Apr 22, 2022

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

2-3 times per day my database logs this and then goes into recovery mode and restarts:

2022-04-15 15:15:24.112 UTC [11443] WARNING:  failed to launch job 1005 "Refresh Continuous Aggregate Policy [1005]": out of background workers
2022-04-15 15:15:24.112 UTC [11443] WARNING:  failed to launch job 1007 "Refresh Continuous Aggregate Policy [1007]": out of background workers
2022-04-15 15:15:24.117 UTC [11443] WARNING:  failed to launch job 1007 "Refresh Continuous Aggregate Policy [1007]": out of background workers
2022-04-15 15:15:24.122 UTC [30121] LOG:  refreshing continuous aggregate "environ_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.131 UTC [30124] LOG:  refreshing continuous aggregate "vsensor_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.131 UTC [30122] LOG:  refreshing continuous aggregate "monitor_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.132 UTC [30123] LOG:  refreshing continuous aggregate "state_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:15:24.527 UTC [30125] LOG:  refreshing continuous aggregate "power_summary_daily" in window [ 2022-03-16 00:00:00+00, 2022-04-14 00:00:00+00 ]
2022-04-15 15:42:02.743 UTC [2442] LOG:  received fast shutdown request
2022-04-15 15:42:02.746 UTC [2442] LOG:  aborting any active transactions
terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.747 UTC [11443] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2022-04-15 15:42:02.747 UTC [30248] grafana@grafana FATAL:  terminating connection due to administrator command
2022-04-15 15:42:02.756 UTC [11448] mcp@mcp FATAL:  terminating connection due to administrator command
2022-04-15 15:42:02.761 UTC [11445] LOG:  terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.761 UTC [11445] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
terminating TimescaleDB background worker launcher due to administrator command2022-04-15 15:42:02.777 UTC [11441] FATAL:  terminating background worker "TimescaleDB Background Worker Launcher" due to administrator command
2022-04-15 15:42:02.777 UTC [11445] LOG:  terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.775 UTC [11447] mcp@mcp FATAL:  terminating connection due to administrator command
terminating TimescaleDB job scheduler due to administrator command
2022-04-15 15:42:02.795 UTC [2442] LOG:  background worker "logical replication launcher" (PID 11442) exited with exit code 1
2022-04-15 15:42:02.809 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 11441) exited with exit code 1
2022-04-15 15:42:02.816 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 11445) exited with exit code 1
2022-04-15 15:42:02.822 UTC [2442] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 11443) exited with exit code 1

I have also seen this message:

2022-04-16 09:43:19.396 UTC [31146] LOG:  background worker "Telemetry Reporter [1]" (PID 56968) was terminated by signal 11: Segmentation fault

The failure of the background worker causes a SIGUSR or SIGQUIT signal to restart the backend process, in gdb I haven't captured a useful stack trace as I would need to be debugging the failing background worker and don't know how to set that up.

TimescaleDB version affected

2.6.1

PostgreSQL version used

14.2

What operating system did you use?

Ubuntu 20.04, Linux 4.9.277-122 #1 SMP PREEMPT Mon Feb 28 14:30:14 UTC 2022 aarch64 GNU/Linux

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

See above.  I have been unable to capture a stack trace.

How can we reproduce the bug?

Unfortunately I don't have a simple reproduction case.  My setup is actually a custom program reading from a different database (a much older postgres/timescale installation) and inserting the data into the new one which is experiencing the crash.  I can provide the various pieces, but they are large/unwieldy and it would be a bit complex to setup (and may require running on the single board ARM computer I'm using if this is aarch64 specific).
@mkindahl mkindahl added segfault Segmentation fault bgw The background worker subsystem, including the scheduler labels Apr 25, 2022
@horzsolt
Copy link

Hi @abrownsword thanks for reaching out. Unfortunately without a stacktrace it is very hard to figure out what happened. As a first step it would be good to understand if the issue happens only when the telemetry bg worker executed? Or there might be other cases when other bg jobs cause the same issue?

@abrownsword
Copy link
Author

Understood, I've struggled with bug reports that don't have good repro cases or stack traces many times. If you can suggest any way I can capture more info, I'll give it a try.

It appears from my log files that for the past 9 days or so it has only happened about twice a day, and only due to the Telemetry Worker failure.

@konskov
Copy link
Contributor

konskov commented Apr 26, 2022

Just to rule out the probability that it is a network issue, could you confirm that your network connection was stable over the past 9 days? The telemetry job depends on the network, so we'd just like to make sure it's not an error there.

@abrownsword
Copy link
Author

It should have been -- is there a particular log I can look at that will note instability events?

@konskov konskov self-assigned this Apr 26, 2022
@konskov
Copy link
Contributor

konskov commented Apr 28, 2022

Hi @abrownsword , sorry for the delay getting back to you. As far as I know, network failures are not logged by postgres. We are trying to look more deeply into your issue so I’d like to ask for some more information you can hopefully provide:

  • does the segfault you are experiencing also occur 2-3 times a day?
  • Would you be able to get a coredump?

@abrownsword
Copy link
Author

Yes, the telemetry reporter failure seems to be the semi-regular event causing the DB to restart.

I'll figure out how to get core dumps turned on.

@abrownsword
Copy link
Author

abrownsword commented May 2, 2022

Okay, I managed to get a core dump... here is the stack trace. Is there anything you'd like from it?

Core was generated by 'postgres: 14/main: Telemetry Reporter [1]'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000005567b2d944 in index_rescan (scan=0x5573487ac8, keys=0x7fed4ecbb0, nkeys=1, orderbys=0x0, norderbys=0)
    at ./build/../src/backend/access/index/indexam.c:302
302	./build/../src/backend/access/index/indexam.c: No such file or directory.
(gdb) bt
#0  0x0000005567b2d944 in index_rescan (scan=0x5573487ac8, keys=0x7fed4ecbb0, nkeys=1, orderbys=0x0, norderbys=0)
    at ./build/../src/backend/access/index/indexam.c:302
#1  0x0000007f2fb1ff14 in ts_scanner_rescan (ctx=0x589, ctx@entry=0x7fed4ecac8, scankey=0x55681d7500 <TTSOpsBufferHeapTuple>)
    at /home/andrew/timescaledb/src/scanner.c:183
#2  0x0000007f2fb20b2c in ts_scan_iterator_rescan (iterator=0x589, iterator@entry=0x7fed4ecac8)
    at /home/andrew/timescaledb/src/scan_iterator.c:59
#3  0x0000007f2fb2befc in ts_scan_iterator_start_or_restart_scan (iterator=0x7fed4ecac8) at /home/andrew/timescaledb/src/scan_iterator.h:121
#4  get_chunk_compression_stats (statsctx=0x7fed4ecac0, chunk=0x0, compr_stats=0x7fed4eca70)
    at /home/andrew/timescaledb/src/telemetry/stats.c:350
#5  process_chunk (statsctx=0x7fed4ecac0, chunk_reltype=RELTYPE_CONTINUOUS_AGG, class=0x7f31d1d990, chunk=0x0)
    at /home/andrew/timescaledb/src/telemetry/stats.c:417
#6  ts_telemetry_stats_gather (stats=0x7fed4ecd98) at /home/andrew/timescaledb/src/telemetry/stats.c:562
#7  0x0000007f2fb2d068 in build_telemetry_report () at /home/andrew/timescaledb/src/telemetry/telemetry.c:426
#8  0x0000007f2fb2cb68 in ts_build_version_request (host=0x7f2fb49d3e "telemetry.timescale.com", path=0x7f2fb49d56 "/v1/metrics")
    at /home/andrew/timescaledb/src/telemetry/telemetry.c:560
#9  ts_telemetry_main (host=0x7f2fb49d3e "telemetry.timescale.com", path=0x7f2fb49d56 "/v1/metrics", service=0x7f2fb49cd7 "https")
    at /home/andrew/timescaledb/src/telemetry/telemetry.c:638
#10 0x0000007f2fb27004 in ts_bgw_job_run_and_set_next_start (job=0x55733c7d68, initial_runs=12, next_interval=0x7fed4ed550, func=<optimized out>)
    at /home/andrew/timescaledb/src/bgw/job.c:971
#11 ts_bgw_job_execute (job=0x55733c7d68) at /home/andrew/timescaledb/src/bgw/job.c:761
#12 0x0000007f2fb272e8 in ts_bgw_job_entrypoint (fcinfo=<optimized out>) at /home/andrew/timescaledb/src/bgw/job.c:877
#13 0x0000005567d69490 in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:875
#14 0x0000005567d76074 in do_start_bgworker (rw=0x55733c3b30) at ./build/../src/backend/postmaster/postmaster.c:5866
#15 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:6091
#16 0x0000005567d769c0 in sigusr1_handler (postgres_signal_arg=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:5255
#17 <signal handler called>
#18 0x0000007f7853836c in __GI___select (nfds=nfds@entry=9, readfds=readfds@entry=0x7fed4eed90, writefds=writefds@entry=0x0, 
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fed4eece8) at ../sysdeps/unix/sysv/linux/select.c:53
#19 0x0000005567d76fb4 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1709
#20 0x0000005567d788a4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1417
#21 0x0000005567ad7290 in main (argc=5, argv=0x557336fe40) at ./build/../src/backend/main/main.c:209

@konskov konskov removed their assignment May 3, 2022
@erimatnor erimatnor self-assigned this May 3, 2022
@jfjoly jfjoly added this to the TimescaleDB 2.7 milestone May 11, 2022
@erimatnor
Copy link
Contributor

@abrownsword Thank you for your efforts to debug this and generating the stack trace. I am seeing some odd things in the stack trace but might need your help in tracking those things down. Specifically, if you look at frame 5, the chunk_reltype=RELTYPE_CONTINUOUS_AGG shouldn't be possible if you read the source code for where that process_chunk function is called. Notice the required cases/values of chunk_reltype needed to call the function. So, I am suspecting either memory corruption or something odd with the stack trace.

First, if you can share the core dump file somehow it would help. Second, if you could give us some information on the tables and other objects you have in your database. How many hypertables, distributed hypertables, caggs, etc. Do they have compression enabled with compressed chunks? Do you have a compressed cagg? This might help us reconstruct an environment that can reproduce the bug.

@abrownsword
Copy link
Author

I'm happy to help. This db is running on an AArch64 platform, so I'm not sure that simply providing the core dump will actually be useful to you..?

Let's discuss options for all this via email instead of on the issue. Can I contact you directly somehow?

erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 19, 2022
This change makes two changes to the telemetry reporter background
worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 19, 2022
This change makes changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 19, 2022
This change makes changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 19, 2022
Make to following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 20, 2022
Make to following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 20, 2022
Make to following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 20, 2022
Make the following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 20, 2022
Make the following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit to erimatnor/timescaledb that referenced this issue May 20, 2022
Make the following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
erimatnor added a commit that referenced this issue May 20, 2022
Make the following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes #4266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug segfault Segmentation fault ship-stopper waiting-for-author
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants