state_groups_state uses a large sequential scan #5064

grinapo · 2019-04-15T16:15:06Z

Description

I have caught a query which runs several times for room history, recursively, and therefore take a significant amount of time (tens of minutes) to scroll up.

WITH RECURSIVE state(state_group) AS ( 
	VALUES(64193::bigint) 
		UNION ALL 
	SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) 
	
SELECT DISTINCT type, state_key, last_value(event_id) 
	OVER ( 
		PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
	) AS event_id FROM state_groups_state WHERE state_group IN ( 
		SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR 
			(type = 'm.room.member' AND state_key = '@grin:grin.hu'))

And the query can be speeded up by

CREATE INDEX state_groups_state_state_key_type ON state_groups_state (state_key, type);

Without it the planner say:

                                 ->  Parallel Seq Scan on state_groups_state  (cost=0.00..284361.03 rows=82 width=83)                                                                                                                                                                                        
                                       Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))

And with it it's

                           ->  Bitmap Heap Scan on state_groups_state  (cost=15.20..792.20 rows=198 width=83)
                                 Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text)))
                                 ->  BitmapOr  (cost=15.20..15.20 rows=198 width=0)
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..4.60 rows=4 width=0)
                                             Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..10.50 rows=194 width=0)
                                             Index Cond: ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text))

which is 175 times faster. ;-)

Version information

If not matrix.org: matrix.grin.hu

Version: 0.99.2
Install method: debian pkg + workers
Platform: debian linux (testing)

The text was updated successfully, but these errors were encountered:

neilisfragile · 2019-04-16T08:37:36Z

Thanks! We'll take a look at that.

richvdh · 2019-05-30T10:58:21Z

I'm really surprised that this makes a difference. Obviously I don't have identical data, but equivalent queries are approximately instant. The query plan should look like:

synapse=# explain WITH RECURSIVE sgs(state_group) AS ( 
    VALUES(2565160::bigint) 
    UNION ALL 
    SELECT prev_state_group FROM state_group_edges e, sgs WHERE sgs.state_group = e.state_group
) 
SELECT DISTINCT type, state_key,
    last_value(event_id) OVER (PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS event_id
FROM state_groups_state
WHERE
    state_group IN (SELECT state_group FROM sgs)
    AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@richvdh:sw1v.org'));
                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=401158.97..401159.55 rows=58 width=79)
   Group Key: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?)
   CTE sgs
     ->  Recursive Union  (cost=0.00..849.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=0)
           ->  Nested Loop  (cost=0.42..84.72 rows=10 width=8)
                 ->  WorkTable Scan on sgs sgs_1  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.42..8.44 rows=1 width=16)
                       Index Cond: (state_group = sgs_1.state_group)
   ->  WindowAgg  (cost=400307.95..400309.25 rows=58 width=79)
         ->  Sort  (cost=400307.95..400308.09 rows=58 width=79)
               Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
               ->  Nested Loop  (cost=2.83..400306.25 rows=58 width=79)
                     ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                           Group Key: sgs.state_group
                           ->  CTE Scan on sgs  (cost=0.00..2.02 rows=101 width=8)
                     ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.56..3963.39 rows=1 width=79)
                           Index Cond: (state_group = sgs.state_group)
                           Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@richvdh:sw1v.org'::text)))
(19 rows)

@grinapo: could you confirm a few things for me:

do you have a (working) state_groups_state_type_idx on state_groups_state(state_group, type, state_key)

what does this show you:

WITH RECURSIVE sgs(state_group) AS ( 
    VALUES(64193::bigint) 
    UNION ALL 
    SELECT prev_state_group FROM state_group_edges e, sgs WHERE sgs.state_group = e.state_group
)
select count(*) FROM sgs;

Can you show the full query plan for the original query (without the index)?

grinapo · 2019-05-30T11:22:00Z

yes I have the index
result is 66
with index the full query

synapse=# explain WITH RECURSIVE state(state_group) AS ( 
synapse(#  VALUES(64193::bigint) 
synapse(#   UNION ALL 
synapse(#  SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) 
synapse-#  
synapse-# SELECT DISTINCT type, state_key, last_value(event_id) 
synapse-#  OVER ( 
synapse(#   PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
synapse(#  ) AS event_id FROM state_groups_state WHERE state_group IN ( 
synapse(#   SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR 
synapse(#    (type = 'm.room.member' AND state_key = '@grin:grin.hu'));
                                                                                             QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2611.37..2611.66 rows=29 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=1775.09..1775.16 rows=29 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=1773.73..1774.39 rows=29 width=81)
               ->  Sort  (cost=1773.73..1773.81 rows=29 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Hash Semi Join  (cost=33.06..1773.03 rows=29 width=83)
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Bitmap Heap Scan on state_groups_state  (cost=29.78..1768.26 rows=443 width=83)
                                 Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text)))
                                 ->  BitmapOr  (cost=29.78..29.78 rows=443 width=0)
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..4.68 rows=12 width=0)
                                             Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..24.87 rows=431 width=0)
                                             Index Cond: ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8)
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
(24 rows)

Dropping the index now gives:

                                                                                             QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=637941.55..637941.84 rows=29 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=637105.27..637105.34 rows=29 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=637103.91..637104.57 rows=29 width=81)
               ->  Sort  (cost=637103.91..637103.99 rows=29 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Hash Semi Join  (cost=1003.28..637103.21 rows=29 width=83)
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..637098.44 rows=443 width=83)
                                 Workers Planned: 2
                                 ->  Parallel Seq Scan on state_groups_state  (cost=0.00..636054.14 rows=185 width=83)
                                       Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8)
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
(21 rows)

richvdh · 2019-05-30T11:30:26Z

I am at a loss to explain why postgres has decided to seq scan rather than index scan :/

grinapo · 2019-05-30T11:41:03Z

Here's an explain analyze verbose with the specific parallel timings (no index):

                                                                                                                                   QUERY PLAN                                                                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=639139.28..639139.56 rows=28 width=81) (actual time=20490.851..20490.853 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.004..106.748 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=1.612..1.614 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=1.610..1.611 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=638303.00..638303.07 rows=28 width=81) (actual time=20490.850..20490.850 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=638301.70..638302.33 rows=28 width=81) (actual time=20490.778..20490.781 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=638301.70..638301.77 rows=28 width=83) (actual time=20490.760..20490.760 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=1003.28..638301.03 rows=28 width=83) (actual time=16058.945..20490.709 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..638296.25 rows=450 width=83) (actual time=4.616..20385.904 rows=3568 loops=1)
                                 Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Parallel Seq Scan on public.state_groups_state  (cost=0.00..637251.25 rows=188 width=83) (actual time=2.375..20372.781 rows=1189 loops=3)
                                       Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group
                                       Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@grin:grin.hu'::text)))
                                       Rows Removed by Filter: 8029366
                                       Worker 0: actual time=1.331..20370.677 rows=1192 loops=1
                                       Worker 1: actual time=2.107..20371.285 rows=1193 loops=1
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=106.931..106.932 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.008..106.837 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.737 ms
 Execution Time: 20495.738 ms
(43 rows)

grinapo · 2019-05-30T11:52:46Z

Forcing set enable_seqscan = off gives a curious result:

                                                                                          QUERY PLAN                                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2303188.77..2303189.05 rows=28 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=2302352.49..2302352.56 rows=28 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=2302351.19..2302351.82 rows=28 width=81)
               ->  Sort  (cost=2302351.19..2302351.26 rows=28 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Nested Loop  (cost=2.83..2302350.52 rows=28 width=83)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.56..22795.51 rows=1 width=83)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))
(20 rows)

grinapo · 2019-05-30T12:02:54Z

And the specific timings with the index, again:

                                                                                                                                   QUERY PLAN                                                                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2630.65..2630.93 rows=28 width=81) (actual time=2931.790..2931.795 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.005..381.330 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=5.767..5.771 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=5.763..5.765 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=1794.37..1794.44 rows=28 width=81) (actual time=2931.789..2931.789 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=1793.07..1793.70 rows=28 width=81) (actual time=2931.759..2931.767 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=1793.07..1793.14 rows=28 width=83) (actual time=2931.727..2931.728 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=29.13..1792.39 rows=28 width=83) (actual time=850.017..2931.702 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Bitmap Heap Scan on public.state_groups_state  (cost=25.85..1787.62 rows=449 width=83) (actual time=9.622..2547.708 rows=3570 loops=1)
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id
                                 Recheck Cond: (((state_groups_state.state_key = ''::text) AND (state_groups_state.type = 'm.room.history_visibility'::text)) OR ((state_groups_state.state_key = '@grin:grin.hu'::text) AND (state_groups_state.type = 'm.room.member'::text)))
                                 Heap Blocks: exact=3348
                                 ->  BitmapOr  (cost=25.85..25.85 rows=449 width=0) (actual time=8.023..8.023 rows=0 loops=1)
                                       ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..4.69 rows=13 width=0) (actual time=4.040..4.041 rows=2011 loops=1)
                                             Index Cond: ((state_groups_state.state_key = ''::text) AND (state_groups_state.type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..20.93 rows=437 width=0) (actual time=3.979..3.979 rows=1559 loops=1)
                                             Index Cond: ((state_groups_state.state_key = '@grin:grin.hu'::text) AND (state_groups_state.type = 'm.room.member'::text))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=381.643..381.644 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.008..381.486 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.978 ms
 Execution Time: 2931.972 ms
(42 rows)

grinapo · 2019-05-30T13:49:44Z

Without the index the specific timings:

                                                                                                                                   QUERY PLAN                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Unique  (cost=639024.96..639025.24 rows=28 width=81) (actual time=64536.841..64536.853 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                              
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.002..1.128 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=0.016..0.016 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=0.015..0.016 rows=1 loops=66)                                                                                                           
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=638188.68..638188.75 rows=28 width=81) (actual time=64536.840..64536.841 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                        
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=638187.38..638188.01 rows=28 width=81) (actual time=64536.796..64536.813 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group                                                                                                                    
               ->  Sort  (cost=638187.38..638187.45 rows=28 width=83) (actual time=64536.725..64536.726 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                                   
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=1003.28..638186.71 rows=28 width=83) (actual time=15003.754..64536.680 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                             
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..638181.93 rows=449 width=83) (actual time=54.843..64542.430 rows=3574 loops=1)
                                 Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group                                                                                                                       
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Parallel Seq Scan on public.state_groups_state  (cost=0.00..637137.03 rows=187 width=83) (actual time=77.351..64509.144 rows=1191 loops=3)                                                                                                   
                                       Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group                                                                                                                 
                                       Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@grin:grin.hu'::text)))  
                                       Rows Removed by Filter: 8035072
                                       Worker 0: actual time=83.634..64505.140 rows=1203 loops=1
                                       Worker 1: actual time=94.049..64501.215 rows=1211 loops=1
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=1.170..1.171 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.003..1.154 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.388 ms
 Execution Time: 64548.342 ms
(43 rows)

grinapo · 2019-05-31T08:55:52Z

Not exactly the 60second version, but longer than 20ms:

                                                                                                                                 QUERY PLAN                                                                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2317039.29..2317039.57 rows=28 width=81) (actual time=814.957..814.958 rows=1 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.006..301.601 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=4.561..4.564 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=4.558..4.559 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=2316203.01..2316203.08 rows=28 width=81) (actual time=814.955..814.956 rows=1 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=2316201.70..2316202.33 rows=28 width=81) (actual time=814.931..814.934 rows=1 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=2316201.70..2316201.77 rows=28 width=83) (actual time=814.903..814.904 rows=1 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.83..2316201.03 rows=28 width=83) (actual time=689.166..814.881 rows=1 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=301.888..302.088 rows=66 loops=1)
                                 Output: state.state_group
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..301.720 rows=66 loops=1)
                                       Output: state.state_group
                           ->  Index Scan using state_groups_state_type_idx on public.state_groups_state  (cost=0.56..22932.64 rows=1 width=83) (actual time=7.446..7.766 rows=0 loops=66)
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id
                                 Index Cond: (state_groups_state.state_group = state.state_group)
                                 Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@test1:grin.hu'::text)))
                                 Rows Removed by Filter: 35
 Planning Time: 0.849 ms
 Execution Time: 815.128 ms
(37 rows)

grinapo · 2019-05-31T10:56:30Z

ALTER TABLE state_groups_state ALTER state_key SET STATISTICS 5000;
(and I guess it isn't necessary, but also ALTER TABLE state_groups_state ALTER type SET STATISTICS 5000;)
ANALYZE state_groups_state
And the result is awesome:

                                                                                                                                 QUERY PLAN                                                                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------      
 Unique  (cost=301413.25..301413.27 rows=2 width=81) (actual time=3.841..3.842 rows=1 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                              
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.004..0.764 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=0.008..0.010 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=66)                                                                                                           
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=300576.97..300576.97 rows=2 width=81) (actual time=3.840..3.840 rows=1 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                        
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=300576.91..300576.96 rows=2 width=81) (actual time=3.821..3.823 rows=1 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group                                                                                                                    
               ->  Sort  (cost=300576.91..300576.92 rows=2 width=83) (actual time=3.807..3.807 rows=1 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                                   
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.83..300576.90 rows=2 width=83) (actual time=1.494..3.795 rows=1 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                             
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=0.874..0.914 rows=66 loops=1)
                                 Output: state.state_group
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..0.821 rows=66 loops=1)
                                       Output: state.state_group
                           ->  Index Scan using state_groups_state_type_idx on public.state_groups_state  (cost=0.56..2975.97 rows=1 width=83) (actual time=0.012..0.043 rows=0 loops=66)                                                                                         
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id                                                                                           
                                 Index Cond: (state_groups_state.state_group = state.state_group)
                                 Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@test1:grin.hu'::text)))       
                                 Rows Removed by Filter: 35
 Planning Time: 3.214 ms
 Execution Time: 4.002 ms
(37 rows)

It seems the stats sample has to be raised. Maybe this applies to other tables.... 🙄

Ralith · 2019-06-15T06:21:20Z

I just caught a nearly identical query running for a long period, simultaneous with synapse 1.0 memory use exploding by a factor of 4 or so:

state(state_group) AS ( VALUES(3916371::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member'));

However, the query plan does not seem to involve any sequential scans:

 Unique  (cost=3815577.00..3833301.85 rows=1772485 width=84)
   CTE state
     ->  Recursive Union  (cost=0.00..849.78 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=0)
           ->  Nested Loop  (cost=0.43..84.78 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.43..8.45 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=3814727.22..3819158.43 rows=1772485 width=84)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=3421247.92..3461128.83 rows=1772485 width=84)
               ->  Sort  (cost=3421247.92..3425679.13 rows=1772485 width=84)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Nested Loop  (cost=2.97..3067649.53 rows=1772485 width=84)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.70..30197.25 rows=17549 width=84)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR (type = 'm.room.member'::text))

erikjohnston · 2019-08-15T10:12:21Z

Yup, I think the postgres statistics are wrong for some people using the default postgres sampling sizes. This is more of a DB ops problem than an app problem, so I don't really know what Synapse can do about it. The options as far as I can see are:

Document in the postgres docs that you may need to up the statistic counts, or
Automatically set them for state_groups_state on the assumption its probably fine to do so

symphorien · 2020-09-12T16:57:32Z

I was affected by very similar symptoms preventing me from logging in at all. I created the index proposed in the opening post and I can now login.

Before:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(23751::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group
 ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_
key = '') OR (type = 'm.room.member' AND state_key = '@symphorien:xlumurb.eu')) ORDER BY type, state_key, state_group DESC;
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=384056.64..384057.11 rows=62 width=85) (actual time=147714.835..147714.841 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.006..30.011 rows=29 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=1.028..1.030 rows=1 loops=29)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=29)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=1.024..1.025 rows=1 loops=29)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=383221.11..383221.27 rows=62 width=85) (actual time=147714.833..147714.833 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Hash Semi Join  (cost=1003.28..383219.27 rows=62 width=85) (actual time=108396.455..147714.784 rows=2 loops=1)
               Hash Cond: (state_groups_state.state_group = state.state_group)
               ->  Gather  (cost=1000.00..383213.81 rows=566 width=85) (actual time=20.619..147685.158 rows=1852 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Seq Scan on state_groups_state  (cost=0.00..382157.21 rows=236 width=85) (actual time=116.617..147511.018 rows=617 loops=3)
                           Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@symphorien:xlumurb.eu'::text)))
                           Rows Removed by Filter: 4541228
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=30.104..30.105 rows=29 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 10kB
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.010..30.053 rows=29 loops=1)
 Planning Time: 1.114 ms
 Execution Time: 147721.242 ms
(24 rows)

with the index:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(23751::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@symphorien:xlumurb.eu')) ORDER BY type, state_key, state_group DESC;
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=3067.43..3067.90 rows=62 width=85) (actual time=11.506..11.515 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.005..0.764 rows=29 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=0.018..0.021 rows=1 loops=29)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.001 rows=1 loops=29)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=0.013..0.015 rows=1 loops=29)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=2231.90..2232.06 rows=62 width=85) (actual time=11.503..11.504 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Hash Semi Join  (cost=34.34..2230.06 rows=62 width=85) (actual time=6.249..11.475 rows=2 loops=1)
               Hash Cond: (state_groups_state.state_group = state.state_group)
               ->  Bitmap Heap Scan on state_groups_state  (cost=31.05..2224.60 rows=565 width=85) (actual time=2.280..9.421 rows=1852 loops=1)
                     Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@symphorien:xlumurb.eu'::text) AND (type = 'm.room.member'::text)))
                     Heap Blocks: exact=1680
                     ->  BitmapOr  (cost=31.05..31.05 rows=565 width=0) (actual time=1.650..1.651 rows=0 loops=1)
                           ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..4.57 rows=1 width=0) (actual time=0.973..0.974 rows=1020 loops=1)
                                 Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                           ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..26.20 rows=564 width=0) (actual time=0.675..0.675 rows=832 loops=1)
                                 Index Cond: ((state_key = '@symphorien:xlumurb.eu'::text) AND (type = 'm.room.member'::text))
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=0.869..0.869 rows=29 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 10kB
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..0.813 rows=29 loops=1)
 Planning Time: 1.200 ms
 Execution Time: 11.696 ms
(26 rows)

symphorien · 2020-09-12T21:31:51Z

I still have a problem with

 WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELEC
T DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.member')) ORDER BY type, state_key, st
ate_group DESC;

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group
 ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.member')) ORDER BY type, state
_key, state_group DESC;                                                                                                                                                                              
                                                                            QUERY PLAN                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------                                   
 Unique  (cost=792687.00..800218.60 rows=265089 width=85) (actual time=143313.918..143313.953 rows=40 loops=1)                                                                                       
   CTE state                                                                                                                                                                                         
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.005..89.438 rows=2 loops=1)                                                                                            
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)                                                                                                    
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=44.706..44.709 rows=0 loops=2)                                                                                           
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.003 rows=1 loops=2)                                                                                           ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=44.695..44.697 rows=0 loops=2)                                                           Index Cond: (state_group = s.state_group)                                                                                                                                     
   ->  Sort  (cost=791851.47..795617.27 rows=1506320 width=85) (actual time=143313.914..143313.922 rows=40 loops=1)                                                                                  
         Sort Key: state_groups_state.state_key, state_groups_state.state_group DESC                                                                                                                 
         Sort Method: quicksort  Memory: 30kB                                                                                                                                                        
         ->  Hash Semi Join  (cost=3.28..493118.48 rows=1506320 width=85) (actual time=551.260..143313.781 rows=40 loops=1)                                                                          
               Hash Cond: (state_groups_state.state_group = state.state_group)                                                                                                                       
               ->  Seq Scan on state_groups_state  (cost=0.00..440339.94 rows=13720933 width=85) (actual time=70.279..138819.420 rows=13722219 loops=1)                                              
                     Filter: (type = 'm.room.member'::text)                                                                                                                                          
                     Rows Removed by Filter: 23456                                                                                                                                                   
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=89.464..89.464 rows=2 loops=1)                                                                                              
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB                                                                                                                                    
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..89.447 rows=2 loops=1)                                                                            
 Planning Time: 0.855 ms                                                                                                                                                                             
 Execution Time: 143314.074 ms

Postgresql overestimates the number of rows in the CTE (here, only two).
If I hint that there are at most 3 rows by adding LIMIT 3, then the query is instant:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state LIMIT 3)  AND ((type = 'm.room.member')) ORDER BY type, state_key, state_group DESC;
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=59437.90..59518.05 rows=16030 width=85) (actual time=225.977..226.012 rows=40 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.004..0.052 rows=2 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=0.020..0.021 rows=0 loops=2)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.001 rows=1 loops=2)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=0.015..0.016 rows=0 loops=2)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=58602.37..58642.45 rows=16030 width=85) (actual time=225.975..225.984 rows=40 loops=1)
         Sort Key: state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 30kB
         ->  Nested Loop  (cost=435.42..57482.80 rows=16030 width=85) (actual time=170.865..225.905 rows=40 loops=1)
               ->  HashAggregate  (cost=0.10..0.13 rows=3 width=8) (actual time=0.068..0.074 rows=2 loops=1)
                     Group Key: state.state_group
                     ->  Limit  (cost=0.00..0.06 rows=3 width=8) (actual time=0.009..0.059 rows=2 loops=1)
                           ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..0.057 rows=2 loops=1)
               ->  Bitmap Heap Scan on state_groups_state  (cost=435.33..19107.46 rows=5343 width=85) (actual time=112.880..112.890 rows=20 loops=2)
                     Recheck Cond: ((state_group = state.state_group) AND (type = 'm.room.member'::text))
                     Heap Blocks: exact=3
                     ->  Bitmap Index Scan on state_groups_state_type_idx  (cost=0.00..433.99 rows=5343 width=0) (actual time=112.861..112.862 rows=20 loops=2)
                           Index Cond: ((state_group = state.state_group) AND (type = 'm.room.member'::text))
 Planning Time: 1.776 ms
 Execution Time: 226.158 ms
(23 rows)

symphorien · 2020-09-14T18:57:45Z

I had more luck with this patch:

diff --git a/synapse/storage/databases/state/bg_updates.py b/synapse/storage/databases/state/bg_updates.py
index 139085b67..2b426b660 100644
--- a/synapse/storage/databases/state/bg_updates.py
+++ b/synapse/storage/databases/state/bg_updates.py
@@ -88,6 +88,7 @@ class StateGroupBackgroundUpdateStore(SQLBaseStore):
             # Temporarily disable sequential scans in this transaction. This is
             # a temporary hack until we can add the right indices in
             txn.execute("SET LOCAL enable_seqscan=off")
+            txn.execute("SET LOCAL enable_bitmapscan=off")
 
             # The below query walks the state_group tree so that the "state"
             # table includes all state_groups in the tree. It then joins

erikjohnston · 2021-07-01T13:50:26Z

I think what is happening here is that if the server is in very large rooms the statistics collector ends up thinking that there are few distinct state groups, and so when we come to query e.g. 50 of them the query planner ends up thinking that "50 state groups is a majority of the table, and so we may as well scan the entire table".

If people who run into this can report what n_distinct is using the following query, that would be useful:

SELECT attname, inherited, n_distinct, correlation
FROM pg_stats 
WHERE tablename = 'state_groups_state';

It should either be very large, or negative.

Two solutions here are:

Add an index on state_group, which allows postgres to have a much more accurate estimate of the number of distinct state groups there are (as a BTree index will track the number of distinct state groups there are).
Manually set n_distinct to something like -1000, where negative numbers mean "number of distinct state groups is proportional to number of rows in the table", using ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_dinstict = -1000) (though I haven't tested this).

erikjohnston · 2021-07-01T14:01:05Z

Add an index on state_group, which allows postgres to have a much more accurate estimate of the number of distinct state groups there are (as a BTree index will track the number of distinct state groups there are).

Actually, I'm not sure that is true. It may be true for unique indices, and creating the index likely updates the n_distinct value to a sane value.

symphorien · 2021-07-07T18:52:32Z

First of all, it looks like I'm not affected by slowness anymore, even dropping the patch mentioned above.

About your query, here are my numbers:

matrix-synapse=# SELECT attname, inherited, n_distinct, correlation
matrix-synapse-# FROM pg_stats        
matrix-synapse-# WHERE tablename = 'state_groups_state';
   attname   | inherited | n_distinct | correlation  
-------------+-----------+------------+--------------                    
state_group | f         |      10013 |   0.10115419
room_id     | f         |         93 |   -0.1697783                                                                                                                                   
type        | f         |         21 |   0.99679154
state_key   | f         |      71505 | -0.056733944      
event_id    | f         |     123828 |  -0.27821112          
                                                                                                                                
(5 rows)

Half-Shot · 2021-07-12T08:18:57Z

#10359 made a night/day difference to my disk IO and CPU usage. Seems like that fixed the problem for me.

richvdh · 2021-10-19T13:22:18Z

A note for others finding this problem: somebody else had ended up (via a mistake with pg_restore) with duplicate data in state_group_edges - which meant that the recursive query on state_group_edges was returning the square of the number of rows it should have done.

(likewise, a loop in state_group_edges would be a massive problem)

neilisfragile added A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label) labels Apr 16, 2019

neilisfragile closed this as completed May 23, 2019

neilisfragile reopened this May 23, 2019

ara4n mentioned this issue Jun 12, 2019

Suspiciously costly join query #5439

Closed

neilisfragile assigned erikjohnston Aug 15, 2019

erikjohnston mentioned this issue Jun 2, 2020

Users stop syncing, postgres queries at high CPU #7618

Closed

richvdh mentioned this issue Jul 1, 2020

Sync query taking upwards of 30 minutes #7772

Closed

erikjohnston mentioned this issue Jul 9, 2021

Reduce likelihood of Postgres table scanning state_groups_state. #10359

Merged

erikjohnston closed this as completed in #10359 Jul 15, 2021

richvdh mentioned this issue Oct 26, 2021

Synapse does enormous expensive queries #11184

Closed

richvdh mentioned this issue Jan 20, 2022

The state_group_edges table doesn't have a unique index; can severely impact WITH RECURSIVE state(state_group) AS ... query performance if duplicates are somehow introduced #11779

Closed

jffrancob mentioned this issue Mar 3, 2022

Initial sync is taking a long time to finish #12158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state_groups_state uses a large sequential scan #5064

state_groups_state uses a large sequential scan #5064

grinapo commented Apr 15, 2019

neilisfragile commented Apr 16, 2019

richvdh commented May 30, 2019

grinapo commented May 30, 2019

richvdh commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 31, 2019

grinapo commented May 31, 2019 •

edited

Loading

Ralith commented Jun 15, 2019 •

edited

Loading

erikjohnston commented Aug 15, 2019

symphorien commented Sep 12, 2020

symphorien commented Sep 12, 2020

symphorien commented Sep 14, 2020

erikjohnston commented Jul 1, 2021

erikjohnston commented Jul 1, 2021

symphorien commented Jul 7, 2021

Half-Shot commented Jul 12, 2021

richvdh commented Oct 19, 2021

state_groups_state uses a large sequential scan #5064

state_groups_state uses a large sequential scan #5064

Comments

grinapo commented Apr 15, 2019

Description

Version information

neilisfragile commented Apr 16, 2019

richvdh commented May 30, 2019

grinapo commented May 30, 2019

richvdh commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 30, 2019

grinapo commented May 31, 2019

grinapo commented May 31, 2019 • edited Loading

Ralith commented Jun 15, 2019 • edited Loading

erikjohnston commented Aug 15, 2019

symphorien commented Sep 12, 2020

symphorien commented Sep 12, 2020

symphorien commented Sep 14, 2020

erikjohnston commented Jul 1, 2021

erikjohnston commented Jul 1, 2021

symphorien commented Jul 7, 2021

Half-Shot commented Jul 12, 2021

richvdh commented Oct 19, 2021

grinapo commented May 31, 2019 •

edited

Loading

Ralith commented Jun 15, 2019 •

edited

Loading