Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing lock to Constraint-aware append #7515

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

erimatnor
Copy link
Contributor

In some cases, a relation was missing a lock when calling relation_excluded_by_constraints(), hitting an assertion. An AccessShareLock or higher must be held on the relation.

In some cases, a relation was missing a lock when calling
relation_excluded_by_constraints(), hitting an assertion. An
AccessShareLock or higher must be held on the relation.
@erimatnor erimatnor added the bug label Dec 4, 2024
@erimatnor
Copy link
Contributor Author

erimatnor commented Dec 4, 2024

Unfortunately, the assertion failure is not easy to reproduce in a test because it doesn't happen all the time. It typically requires a clean-slate session. Might spend some more time doing a test for it later.

Copy link

codecov bot commented Dec 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.19%. Comparing base (59f50f2) to head (01b22dd).
Report is 640 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7515      +/-   ##
==========================================
+ Coverage   80.06%   82.19%   +2.12%     
==========================================
  Files         190      230      +40     
  Lines       37181    43192    +6011     
  Branches     9450    10869    +1419     
==========================================
+ Hits        29770    35501    +5731     
- Misses       2997     3365     +368     
+ Partials     4414     4326      -88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -78,6 +79,8 @@ can_exclude_chunk(PlannerInfo *root, EState *estate, Index rt_index, List *restr
{
RangeTblEntry *rte = rt_fetch(rt_index, estate->es_range_table);

/* Need AccessShareLock to call excluded_by_constraint() */
LockRelationOid(rte->relid, AccessShareLock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it's not locked in this place looks like a bug. All RangeTableEntries are supposed to be locked in AccessShareLock mode very early, in upstream inheritance this is done in expand_inherited_rtentry() -> find_inheritance_children(). Did we somehow miss the lock in our chunk expansion code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like it. You might be right that this could be a bigger issue with our expansion code so I will dig a bit deeper to see what I can find.

@erimatnor
Copy link
Contributor Author

Here's a script that reproduces the assertion crash:

select setseed(0.2);

create table readings(time timestamptz, device int, temp float);

select create_hypertable('readings', 'time', create_default_indexes => false);

insert into readings
select t, ceil(random()*10), random()*40
from generate_series('2022-06-01'::timestamptz, '2022-06-20 00:01:00', '1s') t;

alter table readings set (
      timescaledb.compress,
      timescaledb.compress_orderby = 'time',
      timescaledb.compress_segmentby = 'device'
);

--insert into readings values ('2022-06-01', 1, 1.0), ('2022-06-02', 2, 2.0), ('2022-08-02', 3, 3.0);

create index on readings (time);

select format('%I.%I', chunk_schema, chunk_name)::regclass as chunk
  from timescaledb_information.chunks
 where format('%I.%I', hypertable_schema, hypertable_name)::regclass = 'readings'::regclass
 limit 1 \gset

set timescaledb.enable_chunk_append=off;

select compress_chunk(ch) from show_chunks('readings') ch;
select decompress_chunk('_timescaledb_internal._hyper_1_3_chunk');

--set timescaledb.enable_chunk_append=on;

explain (analyze)
select time, avg(temp), device  from readings
where time  > now() - interval '2 years 5 months 20 days' group by time, device order by time;

@erimatnor
Copy link
Contributor Author

erimatnor commented Dec 10, 2024

@akuzm After some more investigation it seems this is related to parallel query plans, where the spawned workers don't have the relation locked. I don't know if parallel workers are expected to have relations referenced in plan nodes locked at the beginning of plan execution.

@akuzm
Copy link
Member

akuzm commented Dec 10, 2024

@akuzm After some more investigation it seems this is related to parallel query plans, where the spawned workers don't have the relation locked. I don't know if parallel workers are expected to have relations referenced in plan nodes locked at the beginning of plan execution.

Not sure, e.g. the parallel seq scans seemingly doesn't lock tables in the workers, athough I only looked through the code and didn't check. In ChunkAppend, we ultimately switched to performing chunk exclusion in the parallel leader. Maybe it makes sense to do the same here as well. #5857

@erimatnor
Copy link
Contributor Author

@akuzm After some more investigation it seems this is related to parallel query plans, where the spawned workers don't have the relation locked. I don't know if parallel workers are expected to have relations referenced in plan nodes locked at the beginning of plan execution.

Not sure, e.g. the parallel seq scans seemingly doesn't lock tables in the workers, athough I only looked through the code and didn't check. In ChunkAppend, we ultimately switched to performing chunk exclusion in the parallel leader. Maybe it makes sense to do the same here as well. #5857

If we can do chunk exclusion only once it is better. But TBH, I am not sure it is worth the effort given how seldom this scan node is used, so for now I think just taking this lock in the worker is enough. We can always implement leader exclusion later.

@erimatnor erimatnor requested a review from akuzm December 20, 2024 08:09
@erimatnor
Copy link
Contributor Author

@akuzm Are we good to proceed with this fix. It seems the lack of locking is only in parallel queries so it seems correct to grab the lock if it is not already taken. Also seems like a very low-risk fix.

@akuzm
Copy link
Member

akuzm commented Dec 23, 2024

@akuzm Are we good to proceed with this fix. It seems the lack of locking is only in parallel queries so it seems correct to grab the lock if it is not already taken. Also seems like a very low-risk fix.

Sure. Let's move the lock before constify_restrictinfos, that's a complicated function that is usually called during planning, it might have some obscure path that requires a lock as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants