Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix relcache callback handling causing crashes #4193

Merged
merged 2 commits into from
Apr 1, 2022

Conversation

erimatnor
Copy link
Contributor

@erimatnor erimatnor commented Mar 24, 2022

Fix a crash that could corrupt indexes when running VACUUM FULL pg_class.

The crash happens when caches are queried/updated within a cache
invalidation function, which can lead to corruption and recursive
cache invalidations.

To solve the issue, make sure the relcache invalidation callback is
simple and never invokes the relcache or syscache directly or
indirectly.

Some background: The extension is preloaded and thus have planner
hooks installed irrespective of whether the extension is actually
installed or not in the current database. However, the hooks need to
be disabled as long as the extension is not installed. To avoid always
having to dynamically check for the presence of the extension, the
state is cached in the session.

However, the cached state needs to be updated if the extension changes
(altered/dropped/created). Therefore, the relcache invalidation
callback mechanism is (ab)used in TimescaleDB to signal updates to the
extension state across all active backends.

The signaling is implemented by installing a dummy table as part of
the extension and any invalidation on the relid for that table signals
a change in the extension state. However, as of this change, the
actual state is no longer determined in the callback itself, since it
requires use of the relcache and causes the bad behavior. Therefore,
the only thing that remains in the callback after this change is to
reset the extension state.

The actual state is instead resolved on-demand, but can still be
cached when the extension is in the installed state and the dummy
table is present with a known relid. However, if the extension is not
installed, the extension state can no longer be cached as there is no
way to signal other backends that the state should be reset when they
don't know the dummy table's relid, and cannot resolve it from within
the callback itself.

Fixes #3924

Disable-Check: Commit-Count

@erimatnor erimatnor added bug tech-debt Needs refactoring and improvement tasks related to the source code and its architecture. labels Mar 24, 2022
@erimatnor erimatnor self-assigned this Mar 24, 2022
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch 2 times, most recently from 15e8b72 to 343726c Compare March 24, 2022 16:27
src/cache_invalidate.c Outdated Show resolved Hide resolved
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch from 343726c to 7fb64f1 Compare March 25, 2022 07:09
@codecov
Copy link

codecov bot commented Mar 25, 2022

Codecov Report

Merging #4193 (1b3f478) into main (566a4ff) will decrease coverage by 0.01%.
The diff coverage is 96.46%.

❗ Current head 1b3f478 differs from pull request most recent head 6cddc41. Consider uploading reports for the commit 6cddc41 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4193      +/-   ##
==========================================
- Coverage   90.77%   90.75%   -0.02%     
==========================================
  Files         215      215              
  Lines       39384    39464      +80     
==========================================
+ Hits        35750    35815      +65     
- Misses       3634     3649      +15     
Impacted Files Coverage Δ
src/chunk.h 100.00% <ø> (ø)
src/ts_catalog/catalog.h 100.00% <ø> (ø)
tsl/src/fdw/data_node_scan_plan.c 98.21% <ø> (ø)
src/dimension_vector.c 86.36% <50.00%> (-0.85%) ⬇️
src/extension.c 85.85% <88.23%> (-4.44%) ⬇️
src/hypertable_restrict_info.c 95.84% <95.83%> (+0.33%) ⬆️
src/dimension_slice.c 93.88% <97.50%> (+0.06%) ⬆️
src/cache_invalidate.c 91.48% <100.00%> (-0.35%) ⬇️
src/chunk.c 94.16% <100.00%> (-0.64%) ⬇️
src/chunk_constraint.c 94.19% <100.00%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 915bd03...6cddc41. Read the comment docs.

@erimatnor erimatnor requested a review from mkindahl March 25, 2022 07:29
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch from 7fb64f1 to ecf804f Compare March 25, 2022 07:31
@erimatnor
Copy link
Contributor Author

erimatnor commented Mar 25, 2022

I tested manually these two things:

  • That the PR code no longer reproduces the VACUUM FULL pg_class issue in [Bug]: Segmentation Fault when VACUUM FULL pg_class  #3924. Any help in trying to reproduce the bug with this code is appreciated. Hopefully no one can.
  • That extension state tracking still works across multiple sessions:
    • E.g., create extension in one session and make sure the state is "created" in another
    • Make sure that dropping the extension in one session moves the state to "unknown" in another session.

@erimatnor erimatnor marked this pull request as ready for review March 25, 2022 08:25
@erimatnor erimatnor requested a review from a team as a code owner March 25, 2022 08:25
@erimatnor erimatnor requested review from afiskon, pmwkaa, fabriziomello and nikkhils and removed request for a team March 25, 2022 08:25
Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but I need to continue checking.

src/extension_utils.c Outdated Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
src/extension.c Show resolved Hide resolved
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch from ecf804f to 1b3f478 Compare March 25, 2022 12:20
src/extension.c Show resolved Hide resolved
src/extension.c Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
src/cache_invalidate.c Show resolved Hide resolved
Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this since the last comment is not critical.

@mfundul
Copy link
Contributor

mfundul commented Mar 28, 2022

Will look into this PR more seriously as well.

@mfundul
Copy link
Contributor

mfundul commented Mar 28, 2022

We should probably have a disclaimer in comments above every function that is potentially called in syscache callback context.

Something in the spirit of:

/*
 * WARNING: this function may be called in SysCache invalidation context. No SysCache accesses allowed.
 */

Copy link
Contributor

@mfundul mfundul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the above LGTM.

@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch 3 times, most recently from 6cddc41 to e8c8857 Compare March 30, 2022 13:19
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch 2 times, most recently from 8366543 to 9e32a92 Compare March 31, 2022 11:48
Add a TAP test that checks that the extensions state is updated across
concurrent sessions/backends when the extension is "dropped" or
"created".
@erimatnor erimatnor force-pushed the fix-vacuum-full-crash branch from 9e32a92 to 43bedf3 Compare March 31, 2022 11:53
@mfundul mfundul self-requested a review March 31, 2022 12:15
@@ -425,8 +425,7 @@ fdw_relinfo_create(PlannerInfo *root, RelOptInfo *rel, Oid server_oid, Oid local
*/
fpinfo->fdw_startup_cost = DEFAULT_FDW_STARTUP_COST;
fpinfo->fdw_tuple_cost = DEFAULT_FDW_TUPLE_COST;
Assert(ts_extension_oid != InvalidOid);
fpinfo->shippable_extensions = list_make1_oid(ts_extension_oid);
fpinfo->shippable_extensions = list_make1_oid(ts_extension_get_oid());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add this here:

Assert(OidIsValid(fpinfo->shippable_extensions));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the assertion since the function ts_extension_get_oid() cannot return an invalid OID.

endforeach(P_FILE)

set(MODULE_PATHNAME "$libdir/timescaledb-${PROJECT_VERSION_MOD}")
configure_file(functions.sql.in functions.sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All .sql files we have that use @MODULE_PATHNAME@ don't have an .in extension. We're not very consistent. Also, I think this is the first test script that actually uses @MODULE_PATHNAME@.

Copy link
Contributor Author

@erimatnor erimatnor Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test sql files use :MODULE_PATHNAME, because it is set at runtime via psql variables. The solution I use here is different and sets the @MODULE_PATHNAME@ to replace it at configuration (cmake) time. This is also why I added the .in extension. I think both are valid approaches, but the runtime-option isn't easily available for TAP tests.

Also, this sql file isn't strictly a test, but rather a library file. But not sure that matters.

If we are inconsistent elsewhere, we can fix that separately.

@erimatnor erimatnor merged commit 972afe0 into timescale:main Apr 1, 2022
@erimatnor erimatnor deleted the fix-vacuum-full-crash branch April 1, 2022 06:31
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 5, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
This was referenced Apr 5, 2022
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 6, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#3974 Fix remote EXPLAIN with parameterized queries
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #3974 Fix remote EXPLAIN with parameterized queries
* #4122 Fix segfault on INSERT into distributed hypertable
* #4142 Ignore invalid relid when deleting hypertable
* #4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* #4161 Fix memory handling during scans
* #4186 Fix owner change for distributed hypertable
* #4192 Abort sessions after extension reload
* #4193 Fix relcache callback handling causing crashes

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 8, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4176 Fix remote EXPLAIN with parameterized queries
* timescale#4181 Fix spelling errors and omissions
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes
* timescale#4199 Remove signal-unsafe calls from signal handlers
* timescale#4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 11, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4176 Fix remote EXPLAIN with parameterized queries
* timescale#4181 Fix spelling errors and omissions
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes
* timescale#4199 Remove signal-unsafe calls from signal handlers
* timescale#4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 11, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4176 Fix remote EXPLAIN with parameterized queries
* timescale#4181 Fix spelling errors and omissions
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes
* timescale#4199 Remove signal-unsafe calls from signal handlers
* timescale#4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
RafiaSabih added a commit to RafiaSabih/timescaledb that referenced this pull request Apr 11, 2022
This release is patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4176 Fix remote EXPLAIN with parameterized queries
* timescale#4181 Fix spelling errors and omissions
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes
* timescale#4199 Remove signal-unsafe calls from signal handlers
* timescale#4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
svenklemm pushed a commit to RafiaSabih/timescaledb that referenced this pull request Apr 11, 2022
This release is a patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* timescale#4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* timescale#4122 Fix segfault on INSERT into distributed hypertable
* timescale#4142 Ignore invalid relid when deleting hypertable
* timescale#4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* timescale#4161 Fix memory handling during scans
* timescale#4176 Fix remote EXPLAIN with parameterized queries
* timescale#4181 Fix spelling errors and omissions
* timescale#4186 Fix owner change for distributed hypertable
* timescale#4192 Abort sessions after extension reload
* timescale#4193 Fix relcache callback handling causing crashes
* timescale#4199 Remove signal-unsafe calls from signal handlers
* timescale#4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
svenklemm pushed a commit that referenced this pull request Apr 11, 2022
This release is a patch release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* #4121 Fix RENAME TO/SET SCHEMA on distributed hypertable
* #4122 Fix segfault on INSERT into distributed hypertable
* #4142 Ignore invalid relid when deleting hypertable
* #4159 Fix ADD COLUMN IF NOT EXISTS error on compressed hypertable
* #4161 Fix memory handling during scans
* #4176 Fix remote EXPLAIN with parameterized queries
* #4181 Fix spelling errors and omissions
* #4186 Fix owner change for distributed hypertable
* #4192 Abort sessions after extension reload
* #4193 Fix relcache callback handling causing crashes
* #4199 Remove signal-unsafe calls from signal handlers
* #4219 Do not modify aggregation state in finalize

**Thanks**
* @abrownsword for reporting a crash in the telemetry reporter
* @daydayup863 for reporting issue with remote explain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug tech-debt Needs refactoring and improvement tasks related to the source code and its architecture.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Segmentation Fault when VACUUM FULL pg_class
4 participants