-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Periodic background worker crash and DB restart #4266
Comments
Hi @abrownsword thanks for reaching out. Unfortunately without a stacktrace it is very hard to figure out what happened. As a first step it would be good to understand if the issue happens only when the telemetry bg worker executed? Or there might be other cases when other bg jobs cause the same issue? |
Understood, I've struggled with bug reports that don't have good repro cases or stack traces many times. If you can suggest any way I can capture more info, I'll give it a try. It appears from my log files that for the past 9 days or so it has only happened about twice a day, and only due to the Telemetry Worker failure. |
Just to rule out the probability that it is a network issue, could you confirm that your network connection was stable over the past 9 days? The telemetry job depends on the network, so we'd just like to make sure it's not an error there. |
It should have been -- is there a particular log I can look at that will note instability events? |
Hi @abrownsword , sorry for the delay getting back to you. As far as I know, network failures are not logged by postgres. We are trying to look more deeply into your issue so I’d like to ask for some more information you can hopefully provide:
|
Yes, the telemetry reporter failure seems to be the semi-regular event causing the DB to restart. I'll figure out how to get core dumps turned on. |
Okay, I managed to get a core dump... here is the stack trace. Is there anything you'd like from it?
|
@abrownsword Thank you for your efforts to debug this and generating the stack trace. I am seeing some odd things in the stack trace but might need your help in tracking those things down. Specifically, if you look at frame 5, the First, if you can share the core dump file somehow it would help. Second, if you could give us some information on the tables and other objects you have in your database. How many hypertables, distributed hypertables, caggs, etc. Do they have compression enabled with compressed chunks? Do you have a compressed cagg? This might help us reconstruct an environment that can reproduce the bug. |
I'm happy to help. This db is running on an AArch64 platform, so I'm not sure that simply providing the core dump will actually be useful to you..? Let's discuss options for all this via email instead of on the issue. Can I contact you directly somehow? |
This change makes two changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
This change makes changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
This change makes changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make to following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes timescale#4266
Make the following changes to the telemetry reporter background worker: - Add a read lock to the current relation that the reporter collects stats for. This lock protects against concurrent deletion of the relation, which could lead to errors that would prevent the reporter from completing its report. - Set an active snapshot in the telemetry background process for use when scanning a relation for stats collection. - Reopen the scan iterator when collecting chunk compression stats for a relation instead of keeping it open and restarting the scan. The previous approach seems to cause crashes due to memory corruption of the scan state. Unfortunately, the exact cause has not been identified, but the change has been verified to work on a live running instance (thanks to @abrownsword for the help with reproducing the crash and testing fixes). Fixes #4266
What type of bug is this?
Crash
What subsystems and features are affected?
Background worker
What happened?
2-3 times per day my database logs this and then goes into recovery mode and restarts:
I have also seen this message:
The failure of the background worker causes a SIGUSR or SIGQUIT signal to restart the backend process, in gdb I haven't captured a useful stack trace as I would need to be debugging the failing background worker and don't know how to set that up.
TimescaleDB version affected
2.6.1
PostgreSQL version used
14.2
What operating system did you use?
Ubuntu 20.04, Linux 4.9.277-122 #1 SMP PREEMPT Mon Feb 28 14:30:14 UTC 2022 aarch64 GNU/Linux
What installation method did you use?
Source
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?
The text was updated successfully, but these errors were encountered: