Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash and other issues in telemetry reporter #4358

Merged
merged 1 commit into from
May 20, 2022

Conversation

erimatnor
Copy link
Contributor

@erimatnor erimatnor commented May 19, 2022

Make the following changes to the telemetry reporter background
worker:

  • Add a read lock to the current relation that the reporter collects
    stats for. This lock protects against concurrent deletion of the
    relation, which could lead to errors that would prevent the reporter
    from completing its report.

  • Set an active snapshot in the telemetry background process for use
    when scanning a relation for stats collection.

  • Reopen the scan iterator when collecting chunk compression stats for
    a relation instead of keeping it open and restarting the scan. The
    previous approach seems to cause crashes due to memory corruption of
    the scan state. Unfortunately, the exact cause has not been
    identified, but the change has been verified to work on a live
    running instance (thanks to @abrownsword for the help with
    reproducing the crash and testing fixes).

Fixes #4266

@erimatnor
Copy link
Contributor Author

This PR depends on #4349, which should be merged first.

@codecov
Copy link

codecov bot commented May 19, 2022

Codecov Report

Merging #4358 (d0ac8fb) into main (8c5c7bb) will increase coverage by 0.00%.
The diff coverage is 76.92%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #4358   +/-   ##
=======================================
  Coverage   90.81%   90.82%           
=======================================
  Files         217      217           
  Lines       40249    40263   +14     
=======================================
+ Hits        36553    36569   +16     
+ Misses       3696     3694    -2     
Impacted Files Coverage Δ
tsl/src/continuous_aggs/create.c 88.63% <0.00%> (ø)
src/telemetry/telemetry.c 81.25% <16.66%> (-1.46%) ⬇️
src/telemetry/stats.c 97.76% <100.00%> (+0.06%) ⬆️
src/loader/bgw_message_queue.c 85.52% <0.00%> (-2.64%) ⬇️
tsl/src/reorder.c 85.56% <0.00%> (+0.26%) ⬆️
src/bgw/job.c 93.16% <0.00%> (+0.28%) ⬆️
src/bgw/scheduler.c 85.58% <0.00%> (+2.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46c95c4...d0ac8fb. Read the comment docs.

@erimatnor erimatnor force-pushed the telemetry-crash-bgw branch 2 times, most recently from bb9c396 to 4a60757 Compare May 19, 2022 15:39
@erimatnor erimatnor marked this pull request as ready for review May 19, 2022 15:58
@erimatnor erimatnor requested a review from a team as a code owner May 19, 2022 15:58
@erimatnor erimatnor requested review from pmwkaa, mkindahl, svenklemm, fabriziomello, akuzm, jnidzwetzki, mfundul and nikkhils and removed request for a team May 19, 2022 15:58
@svenklemm svenklemm added this to the TimescaleDB 2.7 milestone May 19, 2022
@erimatnor erimatnor force-pushed the telemetry-crash-bgw branch 4 times, most recently from 60d86a8 to e2363fc Compare May 20, 2022 13:43
Make the following changes to the telemetry reporter background worker:

- Add a read lock to the current relation that the reporter collects
  stats for. This lock protects against concurrent deletion of the
  relation, which could lead to errors that would prevent the reporter
  from completing its report.
- Set an active snapshot in the telemetry background process for use
  when scanning a relation for stats collection.

- Reopen the scan iterator when collecting chunk compression stats for
  a relation instead of keeping it open and restarting the scan. The
  previous approach seems to cause crashes due to memory corruption of
  the scan state. Unfortunately, the exact cause has not been
  identified, but the change has been verified to work on a live
  running instance (thanks to @abrownsword for the help with
  reproducing the crash and testing fixes).

Fixes timescale#4266
@erimatnor erimatnor merged commit 7b9d867 into timescale:main May 20, 2022
@erimatnor erimatnor deleted the telemetry-crash-bgw branch May 20, 2022 14:52
svenklemm added a commit to svenklemm/timescaledb that referenced this pull request May 23, 2022
This release adds major new features since the 2.6.1 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* Optimize continuous aggregate query performance and storage
* The following query clauses and functions can now be used in a continuous
  aggregate: FILTER, DISTINCT, ORDER BY as well as [Ordered-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
  and [Hypothetical-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-HYPOTHETICAL-TABLE)
* Optimize now() query planning time
* Improve COPY insert performance
* Improve performance of UPDATE/DELETE on PG14 by excluding chunks

This release also includes several bug fixes.

If you are upgrading from a previous version and were using compression
with a non-default collation on a segmentby-column you should recompress
those hypertables.

**Features**
* timescale#4045 Custom origin's support in CAGGs
* timescale#4120 Add logging for retention policy
* timescale#4158 Allow ANALYZE command on a data node directly
* timescale#4169 Add support for chunk exclusion on DELETE to PG14
* timescale#4209 Add support for chunk exclusion on UPDATE to PG14
* timescale#4269 Continuous Aggregates finals form
* timescale#4301 Add support for bulk inserts in COPY operator
* timescale#4311 Support non-superuser move chunk operations
* timescale#4330 Add GUC "bgw_launcher_poll_time"
* timescale#4340 Enable now() usage in plan-time chunk exclusion

**Bugfixes**
* timescale#3899 Fix segfault in Continuous Aggregates
* timescale#4225 Fix TRUNCATE error as non-owner on hypertable
* timescale#4236 Fix potential wrong order of results for compressed hypertable with a non-default collation
* timescale#4249 Fix option "timescaledb.create_group_indexes"
* timescale#4251 Fix INSERT into compressed chunks with dropped columns
* timescale#4255 Fix option "timescaledb.create_group_indexes"
* timescale#4259 Fix logic bug in extension update script
* timescale#4269 Fix bad Continuous Aggregate view definition reported in timescale#4233
* timescale#4289 Support moving compressed chunks between data nodes
* timescale#4300 Fix refresh window cap for cagg refresh policy
* timescale#4315 Fix memory leak in scheduler
* timescale#4323 Remove printouts from signal handlers
* timescale#4342 Fix move chunk cleanup logic
* timescale#4349 Fix crashes in functions using AlterTableInternal
* timescale#4358 Fix crash and other issues in telemetry reporter

**Thanks**
* @abrownsword for reporting a bug in the telemetry reporter and testing the fix
* @jsoref for fixing various misspellings in code, comments and documentation
* @yalon for reporting an error with ALTER TABLE RENAME on distributed hypertables
* @zhuizhuhaomeng for reporting and fixing a memory leak in our scheduler
@svenklemm svenklemm mentioned this pull request May 23, 2022
svenklemm added a commit that referenced this pull request May 23, 2022
This release adds major new features since the 2.6.1 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* Optimize continuous aggregate query performance and storage
* The following query clauses and functions can now be used in a continuous
  aggregate: FILTER, DISTINCT, ORDER BY as well as [Ordered-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
  and [Hypothetical-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-HYPOTHETICAL-TABLE)
* Optimize now() query planning time
* Improve COPY insert performance
* Improve performance of UPDATE/DELETE on PG14 by excluding chunks

This release also includes several bug fixes.

If you are upgrading from a previous version and were using compression
with a non-default collation on a segmentby-column you should recompress
those hypertables.

**Features**
* #4045 Custom origin's support in CAGGs
* #4120 Add logging for retention policy
* #4158 Allow ANALYZE command on a data node directly
* #4169 Add support for chunk exclusion on DELETE to PG14
* #4209 Add support for chunk exclusion on UPDATE to PG14
* #4269 Continuous Aggregates finals form
* #4301 Add support for bulk inserts in COPY operator
* #4311 Support non-superuser move chunk operations
* #4330 Add GUC "bgw_launcher_poll_time"
* #4340 Enable now() usage in plan-time chunk exclusion

**Bugfixes**
* #3899 Fix segfault in Continuous Aggregates
* #4225 Fix TRUNCATE error as non-owner on hypertable
* #4236 Fix potential wrong order of results for compressed hypertable with a non-default collation
* #4249 Fix option "timescaledb.create_group_indexes"
* #4251 Fix INSERT into compressed chunks with dropped columns
* #4255 Fix option "timescaledb.create_group_indexes"
* #4259 Fix logic bug in extension update script
* #4269 Fix bad Continuous Aggregate view definition reported in #4233
* #4289 Support moving compressed chunks between data nodes
* #4300 Fix refresh window cap for cagg refresh policy
* #4315 Fix memory leak in scheduler
* #4323 Remove printouts from signal handlers
* #4342 Fix move chunk cleanup logic
* #4349 Fix crashes in functions using AlterTableInternal
* #4358 Fix crash and other issues in telemetry reporter

**Thanks**
* @abrownsword for reporting a bug in the telemetry reporter and testing the fix
* @jsoref for fixing various misspellings in code, comments and documentation
* @yalon for reporting an error with ALTER TABLE RENAME on distributed hypertables
* @zhuizhuhaomeng for reporting and fixing a memory leak in our scheduler
mfundul pushed a commit to mfundul/timescaledb that referenced this pull request May 24, 2022
This release adds major new features since the 2.6.1 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* Optimize continuous aggregate query performance and storage
* The following query clauses and functions can now be used in a continuous
  aggregate: FILTER, DISTINCT, ORDER BY as well as [Ordered-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
  and [Hypothetical-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-HYPOTHETICAL-TABLE)
* Optimize now() query planning time
* Improve COPY insert performance
* Improve performance of UPDATE/DELETE on PG14 by excluding chunks

This release also includes several bug fixes.

If you are upgrading from a previous version and were using compression
with a non-default collation on a segmentby-column you should recompress
those hypertables.

**Features**
* timescale#4045 Custom origin's support in CAGGs
* timescale#4120 Add logging for retention policy
* timescale#4158 Allow ANALYZE command on a data node directly
* timescale#4169 Add support for chunk exclusion on DELETE to PG14
* timescale#4209 Add support for chunk exclusion on UPDATE to PG14
* timescale#4269 Continuous Aggregates finals form
* timescale#4301 Add support for bulk inserts in COPY operator
* timescale#4311 Support non-superuser move chunk operations
* timescale#4330 Add GUC "bgw_launcher_poll_time"
* timescale#4340 Enable now() usage in plan-time chunk exclusion

**Bugfixes**
* timescale#3899 Fix segfault in Continuous Aggregates
* timescale#4225 Fix TRUNCATE error as non-owner on hypertable
* timescale#4236 Fix potential wrong order of results for compressed hypertable with a non-default collation
* timescale#4249 Fix option "timescaledb.create_group_indexes"
* timescale#4251 Fix INSERT into compressed chunks with dropped columns
* timescale#4255 Fix option "timescaledb.create_group_indexes"
* timescale#4259 Fix logic bug in extension update script
* timescale#4269 Fix bad Continuous Aggregate view definition reported in timescale#4233
* timescale#4289 Support moving compressed chunks between data nodes
* timescale#4300 Fix refresh window cap for cagg refresh policy
* timescale#4315 Fix memory leak in scheduler
* timescale#4323 Remove printouts from signal handlers
* timescale#4342 Fix move chunk cleanup logic
* timescale#4349 Fix crashes in functions using AlterTableInternal
* timescale#4358 Fix crash and other issues in telemetry reporter

**Thanks**
* @abrownsword for reporting a bug in the telemetry reporter and testing the fix
* @jsoref for fixing various misspellings in code, comments and documentation
* @yalon for reporting an error with ALTER TABLE RENAME on distributed hypertables
* @zhuizhuhaomeng for reporting and fixing a memory leak in our scheduler
@mfundul mfundul mentioned this pull request May 24, 2022
mfundul pushed a commit that referenced this pull request May 24, 2022
This release adds major new features since the 2.6.1 release.
We deem it moderate priority for upgrading.

This release includes these noteworthy features:

* Optimize continuous aggregate query performance and storage
* The following query clauses and functions can now be used in a continuous
  aggregate: FILTER, DISTINCT, ORDER BY as well as [Ordered-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)
  and [Hypothetical-Set Aggregate](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-HYPOTHETICAL-TABLE)
* Optimize now() query planning time
* Improve COPY insert performance
* Improve performance of UPDATE/DELETE on PG14 by excluding chunks

This release also includes several bug fixes.

If you are upgrading from a previous version and were using compression
with a non-default collation on a segmentby-column you should recompress
those hypertables.

**Features**
* #4045 Custom origin's support in CAGGs
* #4120 Add logging for retention policy
* #4158 Allow ANALYZE command on a data node directly
* #4169 Add support for chunk exclusion on DELETE to PG14
* #4209 Add support for chunk exclusion on UPDATE to PG14
* #4269 Continuous Aggregates finals form
* #4301 Add support for bulk inserts in COPY operator
* #4311 Support non-superuser move chunk operations
* #4330 Add GUC "bgw_launcher_poll_time"
* #4340 Enable now() usage in plan-time chunk exclusion

**Bugfixes**
* #3899 Fix segfault in Continuous Aggregates
* #4225 Fix TRUNCATE error as non-owner on hypertable
* #4236 Fix potential wrong order of results for compressed hypertable with a non-default collation
* #4249 Fix option "timescaledb.create_group_indexes"
* #4251 Fix INSERT into compressed chunks with dropped columns
* #4255 Fix option "timescaledb.create_group_indexes"
* #4259 Fix logic bug in extension update script
* #4269 Fix bad Continuous Aggregate view definition reported in #4233
* #4289 Support moving compressed chunks between data nodes
* #4300 Fix refresh window cap for cagg refresh policy
* #4315 Fix memory leak in scheduler
* #4323 Remove printouts from signal handlers
* #4342 Fix move chunk cleanup logic
* #4349 Fix crashes in functions using AlterTableInternal
* #4358 Fix crash and other issues in telemetry reporter

**Thanks**
* @abrownsword for reporting a bug in the telemetry reporter and testing the fix
* @jsoref for fixing various misspellings in code, comments and documentation
* @yalon for reporting an error with ALTER TABLE RENAME on distributed hypertables
* @zhuizhuhaomeng for reporting and fixing a memory leak in our scheduler
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Periodic background worker crash and DB restart
4 participants