[Bug]: Bad plan produced when using CAGG name but good plan produced when using hypertable name! #6286

jvanns · 2023-11-06T14:20:41Z

What type of bug is this?

Performance issue

What subsystems and features are affected?

Continuous aggregate, Query planner

What happened?

Well, this is an odd one for sure and it's plagued be for weeks until I managed to narrow down where the problem lies. Although I cannot easily craft a full MRP for you, I can provide a straight-forward query and the resulting plans for you. The crux of the issue appears to be some sort of disconnect between CAGG name and hypertable name. I have 2x CAGGs - lets call them A and B. A sources from the main 'raw' hypertable, B sources from A thus creating a hierarchical CAGG setup. CAGG A buckets at 15m intervals, B at 30m intervals. The whole DB schema is created up-front via a .sql script and the CAGGs are explicitly created thus;

materialized only
NO DATA
chunk interval set to 12h on both
compression enabled

Here are the links to both the queries and respective plans;

https://explain.dalibo.com/plan/eag931806e5b6h82#plan
https://explain.dalibo.com/plan/0c8fbe46e424142e#plan
https://explain.dalibo.com/plan/54ec723bca45e7f7#plan
https://explain.dalibo.com/plan/a21542ffb46acf57#plan

Basically, using the underlying materialized hypertable name produces an expected good plan (executing in milliseconds) whereas using the CAGG name, we get a very odd 'bad' plan (executing in minutes). Whats more is this appears to effect other queries - the one provided is a simple way of reproducing it.

TimescaleDB version affected

2.11

PostgreSQL version used

15.3

What operating system did you use?

Centos 7.9

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

https://explain.dalibo.com/plan/eag931806e5b6h82#plan
https://explain.dalibo.com/plan/0c8fbe46e424142e#plan
https://explain.dalibo.com/plan/54ec723bca45e7f7#plan
https://explain.dalibo.com/plan/a21542ffb46acf57#plan

How can we reproduce the bug?

Hopefully enough information is given above to start with. I'll try and provide an isolated MRP later.

jvanns · 2023-11-06T14:53:29Z

This could be related to #2841 somehow?

zilder · 2023-11-17T16:33:42Z

Hi @jvanns,

Could you pls share the definition of your cagg's materialized view (and if possible other details such as schema definition, caggs hierarchy definition etc)

I've been trying to reproduce your use-case. I've noticed that the slower one pulls all the rows first then sorts them and runs aggregation, while the fast one just picks the first row from the ChunkAppend node (ordered by ts DESC) and calls it a day. So something prevents the first one from using ordered ChunkAppend.

I think I was able to reproduce the issue (see the example below). Pls let me know if it is similar to your situation.

-- create and populate a hypertable
create table test (ts timestamp not null, device_id int, val float);
select create_hypertable('test', 'ts', chunk_time_interval=>'1 day'::interval);

insert into test
select
    generate_series('2023-11-01', '2023-11-15', '1 minute'::interval) as ts,
    (random() * 10)::int as device_id,
    random() * 10 as val;

-- create an explicitly ORDERED cagg
create materialized view test_15m
with (
    timescaledb.continuous,
    timescaledb.materialized_only = true
) as
select
  time_bucket('15 minute', "ts") AS ts,
min(val), max(val), avg(val)
from test
group by 1
order by 1;

-- a cagg without ORDER BY
create materialized view test_15m_unordered
with (
    timescaledb.continuous,
    timescaledb.materialized_only = true
) as
select
  time_bucket('15 minute', "ts") AS ts,
min(val), max(val), avg(val)
from test
group by 1;

And the query plans. The first one uses the matview with explicit ORDER BY. It looks similar to your slow plan example (i.e. also pulling all the rows first, then aggregating):

# explain (analyze, costs off) select max(ts) from test_15m;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=1.358..1.362 rows=1 loops=1)
   ->  Result (actual time=0.042..1.157 rows=1345 loops=1)
         ->  Custom Scan (ChunkAppend) on _materialized_hypertable_2 (actual time=0.035..0.877 rows=1345 loops=1)
               Order: _materialized_hypertable_2.ts
               ->  Index Only Scan Backward using _hyper_2_16_chunk__materialized_hypertable_2_ts_idx on _hyper_2_16_chunk (actual time=0.033..0.417 rows=768 loops=1)
                     Heap Fetches: 768
               ->  Index Only Scan Backward using _hyper_2_17_chunk__materialized_hypertable_2_ts_idx on _hyper_2_17_chunk (actual time=0.015..0.294 rows=577 loops=1)
                     Heap Fetches: 577
 Planning Time: 1.410 ms
 Execution Time: 1.439 ms
(10 rows)

The second one just picks the first row from the top of the ordered by ts DESC ChunkAppend and finishes:

# explain (analyze, costs off) select max(ts) from test_15m_unordered;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result (actual time=0.054..0.056 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit (actual time=0.047..0.049 rows=1 loops=1)
           ->  Custom Scan (ChunkAppend) on _materialized_hypertable_3 (actual time=0.045..0.046 rows=1 loops=1)
                 Order: _materialized_hypertable_3.ts DESC
                 ->  Index Only Scan using _hyper_3_19_chunk__materialized_hypertable_3_ts_idx on _hyper_3_19_chunk (actual time=0.044..0.044 rows=1 loops=1)
                       Index Cond: (ts IS NOT NULL)
                       Heap Fetches: 1
                 ->  Index Only Scan using _hyper_3_18_chunk__materialized_hypertable_3_ts_idx on _hyper_3_18_chunk (never executed)
                       Index Cond: (ts IS NOT NULL)
                       Heap Fetches: 0
 Planning Time: 1.198 ms
 Execution Time: 0.111 ms
(13 rows)

jvanns · 2023-11-20T15:08:41Z

Sure, with some obfuscation to hide any potential sensitive schema naming etc., our CAGG definition looks like this;

CREATE MATERIALIZED VIEW foobar_metrics_15m
WITH (timescaledb.continuous) AS
SELECT
   -- Try to match order and names of source metrics table first
   time_bucket(INTERVAL '30 minutes', ts) AS ts,
   dev_id,
   loc_id,
   ctxt_id,
   SUM(num) AS num,
   SUM(size) AS size,
   (result = 0)::BOOLEAN AS success,
   ((flags & 2::SMALLINT) != 0)::BOOLEAN AS special,
   PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY duration) AS duration,
   operation_class(code) AS function_class
FROM metrics
GROUP BY
      -- All bar the aggregates, obviously...
      1,2,3,4,7,8,10
   ORDER BY
      -- Groupings
      1,2,3,4,10,
      -- Aggregates
      5 DESC, -- Num
      6 DESC, -- Size
      7 DESC, -- Success?
      8 DESC  -- Special?
WITH NO DATA;
ALTER MATERIALIZED VIEW foobar_metrics_15m SET (
   timescaledb.materialized_only = true
);
SELECT set_chunk_time_interval(
   materialization_hypertable_name::TEXT,
   INTERVAL '6 hours'
)
FROM timescaledb_information.continuous_aggregates
WHERE view_name = 'foobar_metrics_15m';

So you think its more to do with explicit ordering in the definition? Note that the peculiarity I reported seemed to be that if you queried via the CAGG VIEW name you get the 'bad' plan yet the exact same query against the underlying hypertable name, .e.g _materialized_hypertable_2 in your example above, produced the 'good' plan and executed in a far more acceptable time!

Thanks for looking into it and reporting back. Glad to see there is at least a reproducible discrepancy of sorts.

jvanns · 2024-04-09T10:51:56Z

Just wanted to flag that this bug is still present and renders our CAGGs largely useless because they're impractical (result sets never returned in a reasonable time). I've managed to confirm the oddity does appear to arise from the use of ORDERing in the CAGG definition as @zilder has suspected and reproduced. I've attached two more 'bad' and 'better' plans that demonstrate the plan divergence simply by changing the CAGG name for the materialized hypertable name. Note that in tbe bad plan, for some reason, the ORDERing is applied for each column despite only ever querying on the timestamp (the 1st column in the ORDERing). The better plan is definitely more acceptable and expected.

better-plan.txt
bad-plan.txt

jvanns · 2024-04-11T15:47:47Z

I can confirm that this indeed due to the use of an ORDER BY in the materialized view query definition. With it, queries always require a sort key using all those chosen fields. If removed, then the query is planned and executed more how you'd expect.

nckswt · 2024-08-14T18:54:03Z

Just noting that we've had to disable realtime aggregates, seemingly because of this issue 😢

jvanns added the bug label Nov 6, 2023

jvanns mentioned this issue Nov 9, 2023

Contiguous Aggregate Performance #2841

Open

jnidzwetzki added the continuous_aggregate label Nov 29, 2023

jvanns mentioned this issue Jul 16, 2024

[Bug]: realtime aggregation select planning and execution time is poor #7015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Bad plan produced when using CAGG name but good plan produced when using hypertable name! #6286

[Bug]: Bad plan produced when using CAGG name but good plan produced when using hypertable name! #6286

jvanns commented Nov 6, 2023

jvanns commented Nov 6, 2023

zilder commented Nov 17, 2023

jvanns commented Nov 20, 2023

jvanns commented Apr 9, 2024

jvanns commented Apr 11, 2024

nckswt commented Aug 14, 2024

[Bug]: Bad plan produced when using CAGG name but good plan produced when using hypertable name! #6286

[Bug]: Bad plan produced when using CAGG name but good plan produced when using hypertable name! #6286

Comments

jvanns commented Nov 6, 2023

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

jvanns commented Nov 6, 2023

zilder commented Nov 17, 2023

jvanns commented Nov 20, 2023

jvanns commented Apr 9, 2024

jvanns commented Apr 11, 2024

nckswt commented Aug 14, 2024