Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ERROR when acquiring a tuple lock on OSM chunks on replica #7035

Merged
merged 1 commit into from
Jun 18, 2024

Conversation

zilder
Copy link
Contributor

@zilder zilder commented Jun 17, 2024

Planner attempts acquiring a LockTupleKeyShare tuple lock on an OSM chunk's dimension slice which results in an error on
replica. This patch fixes the issue by avoiding tuple locking when in recovery mode.

Fix for timescale/Support-Dev-Collab#1847

Disable-check: force-changelog-file

Copy link

codecov bot commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.74%. Comparing base (59f50f2) to head (673b7d4).
Report is 212 commits behind head on main.

Files Patch % Lines
src/hypertable.c 0.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7035      +/-   ##
==========================================
+ Coverage   80.06%   81.74%   +1.67%     
==========================================
  Files         190      199       +9     
  Lines       37181    37011     -170     
  Branches     9450     9676     +226     
==========================================
+ Hits        29770    30253     +483     
+ Misses       2997     2859     -138     
+ Partials     4414     3899     -515     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zilder zilder marked this pull request as ready for review June 17, 2024 12:07
@zilder zilder requested a review from nikkhils June 17, 2024 12:08
Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit description is not very clear. It does fix an error, but a better description would be a functional one about what the pull request actually does, which is something along the lines "Do not lock tuples in OSM chunks on replica servers".

It would also be helpful to add comments describing why, so that your future self knows why you added it.

src/hypertable.c Outdated
@@ -2560,6 +2563,7 @@ ts_hypertable_osm_range_update(PG_FUNCTION_ARGS)
bool overlap = false, range_invalid = false;

/* Lock tuple FOR UPDATE */
Assert(!RecoveryInProgress());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert cannot be used as a check in production deployments and is no guarantee that it will never be reached.

What happen if we run this code on a secondary that is in recovery? Will that cause a production issue?

If so, we should have a check in place so that the tuple is not locked when the server is in recovery, but lock it otherwise.

Copy link
Contributor Author

@zilder zilder Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I changed the commit message and added a comment.

The assert cannot be used as a check in production deployments and is no guarantee that it will never be reached.

It wasn't meant as a runtime check, rather a documentation for myself and other devs. (Production images are not compiled with asserts enabled, right?). I added a comment there for clarity. This function would not normally be reached from the osm code: on replica it would fail before this call. But even when called directly from psql, it would not cause crash, it would fail with the same error as in the original issue:

ERROR:  cannot assign TransactionIds during recovery

when trying to update the catalog.

I can remove the assert altogether though. I only added it because this func calls ts_chunk_get_osm_slice_and_lock and it should not happen in recovery mode, so that we could catch it early if it happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I changed the commit message and added a comment.

The assert cannot be used as a check in production deployments and is no guarantee that it will never be reached.

It wasn't meant as a runtime check, rather a documentation for myself and other devs. (Production images are not compiled with asserts enabled, right?). I added a comment there for clarity. This function would not normally be reached from the osm code: on replica it would fail before this call. But even when called directly from psql, it would not cause crash, it fail with the same error as in the original issue:

ERROR:  cannot assign TransactionIds during recovery

when trying to update catalog.

Thank you for explaining it.

I can remove the assert altogether though. I only added it because this func calls ts_chunk_get_osm_slice_and_lock and it should not happen in recovery mode, so that we could catch it early if it happens.

No, keeping asserts is good for documentation purposes and declaring intentions and help debugging, as you mentioned above. I was just unsure if this meant that the code after would execute in production builds, causing havoc. If that cannot happen, a comment on why this cannot happen and a test (if possible) is good enough to ensure that we do not try to execute bad code in production builds.

@zilder zilder force-pushed the zilder/fix_replica_osm_tuple_lock branch from 97ca75d to 35380e7 Compare June 18, 2024 10:08
src/hypertable.c Outdated
Comment on lines 2511 to 2524
/* this function is not meant to be run on a read-only secondary */
Assert(!RecoveryInProgress());

Copy link
Contributor

@mkindahl mkindahl Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your future self, could you extend the comment to describe why we can't reach this point in release builds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do

@zilder zilder force-pushed the zilder/fix_replica_osm_tuple_lock branch from 35380e7 to b9566b6 Compare June 18, 2024 13:15
@@ -2461,14 +2461,23 @@ ts_chunk_get_osm_slice_and_lock(int32 osm_chunk_id, int32 time_dim_id, LockTuple
.lockmode = tuplockmode,
.waitpolicy = LockWaitBlock,
};
/*
* We cannot acquire a tuple lock when running in recovery mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* We cannot acquire a tuple lock when running in recovery mode
it is not possible to acquire tuple lock when running in recovery mode as they require a transaction id and transaction ids cannot be created in recovery mode, so we only acquire the tuple lock if we are not in recovery mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more detail to the comment to explain why tuple locks cannot be acquired in recovery mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarification

Planner attempts acquiring a `LockTupleKeyShare` tuple lock on an OSM
chunk's dimension slice which results in an error on replica. This
patch removes tuple lock when in recovery mode.
@zilder zilder force-pushed the zilder/fix_replica_osm_tuple_lock branch from b9566b6 to 673b7d4 Compare June 18, 2024 14:27
@zilder zilder requested a review from gayyappan June 18, 2024 14:28
@zilder zilder merged commit f68006b into timescale:main Jun 18, 2024
38 of 40 checks passed
@gayyappan gayyappan added this to the TimescaleDB 2.15.3 milestone Jun 29, 2024
pallavisontakke added a commit to pallavisontakke/timescaledb that referenced this pull request Jul 2, 2024
For PRs timescale#7088 and timescale#7035
Some more edits for docs compliance.
@pallavisontakke pallavisontakke mentioned this pull request Jul 2, 2024
pallavisontakke added a commit that referenced this pull request Jul 2, 2024
This release contains bug fixes since the 2.15.2 release.
Best practice is to upgrade at the next available opportunity.

**Migrating from self-hosted TimescaleDB v2.14.x and earlier**

After you run `ALTER EXTENSION`, you must run [this SQL
script](https://github.com/timescale/timescaledb-extras/blob/master/utils/2.15.X-fix_hypertable_foreign_keys.sql).
For more details, see the following pull request
[#6797](#6797).

If you are migrating from TimescaleDB v2.15.0, v2.15.1 or v2.15.2, no
changes are required.

**Bugfixes**
* #7061: Fix the handling of multiple unique indexes in a compressed
INSERT.
* #7080: Fix the `corresponding equivalence member not found` error.
* #7088: Fix the leaks in the DML functions.
* #7035: Fix the error when acquiring a tuple lock on the OSM chunks on
the replica.

**Thanks**
* @Kazmirchuk for reporting the issue about leaks with the functions in
DML.
@pallavisontakke pallavisontakke mentioned this pull request Jul 2, 2024
svenklemm added a commit that referenced this pull request Jul 2, 2024
This release contains bug fixes since the 2.15.2 release.
Best practice is to upgrade at the next available opportunity.

**Migrating from self-hosted TimescaleDB v2.14.x and earlier**

After you run `ALTER EXTENSION`, you must run [this SQL
script](https://github.com/timescale/timescaledb-extras/blob/master/utils/2.15.X-fix_hypertable_foreign_keys.sql).
For more details, see the following pull request
[#6797](#6797).

If you are migrating from TimescaleDB v2.15.0, v2.15.1 or v2.15.2, no
changes are required.

**Bugfixes**
* #7061: Fix the handling of multiple unique indexes in a compressed
INSERT.
* #7080: Fix the `corresponding equivalence member not found` error.
* #7088: Fix the leaks in the DML functions.
* #7035: Fix the error when acquiring a tuple lock on the OSM chunks on
the replica.
* #7091: Fix ORDER BY/GROUP BY expression not found in targetlist on
PG16

**Thanks**
* @Kazmirchuk for reporting the issue about leaks with the functions in
DML.

---------

Signed-off-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com>
Co-authored-by: Sven Klemm <31455525+svenklemm@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants