Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Fizzadar · 2022-10-04T15:40:57Z

Hard to write title, feel free to change! We're facing a problem where a sync worker will stop processing requires entirely blocked by a call to get_rooms_for_users(_with_stream_ordering). I think what is happening is roughly:

Sync comes in which is very behind, triggers a huge device list update
This then calls get_rooms_for_users around here
This then blocks all calls to get_rooms_for_user that is used as part of the regular sync

It seems certain combinations of cache invalidation and request mean every user is included in the cached list call, which in turn blocks all sync requests on that instance until it clears.

The queries are taking minutes to run (and the database is not at max throughput), example:

SELECT c.state_key, room_id, e.instance_name, e.stream_ordering
                FROM current_state_events AS c
                INNER JOIN events AS e USING (room_id, event_id)
                WHERE
                    c.type = 'm.room.member'
                    AND c.membership = 'join'
                    AND c.state_key = ANY(ARRAY['... huge array of thousands of user IDs ...'])
(1 row)

Some thoughts:

should calls to @cached block if there's a relevant @cachedList call ongoing?
if so, how does the critical sync path not be blocked like above
should the calls to get_rooms_for_users be batched, currently unbounded?
should we reject sync requests that are super old and force a re-init?

Related: #14037

The text was updated successfully, but these errors were encountered:

DMRobertson · 2022-10-06T15:47:40Z

should the calls to get_rooms_for_users be batched, currently unbounded?

Very probably --- sounds like we need to use batch_iter/chunk_seq or similar to break this up into smaller chunks.

DMRobertson · 2022-10-06T15:48:34Z

Out of interest, are you able to retrieve the query plan for one of the large queries with thousands of IDs?

Fizzadar · 2022-10-08T09:51:27Z

Yes!: Since bringing in a466164 this is now reduced to a single index lookup since the join is gone, but still makes sense to use batch_iter I think.

Gather  (cost=1479.00..43813.18 rows=1 width=94)
  Workers Planned: 2
  ->  Nested Loop  (cost=479.00..42813.08 rows=1 width=94)
        ->  Parallel Bitmap Heap Scan on current_state_events c  (cost=478.30..27242.74 rows=4007 width=105)
"              Recheck Cond: ((state_key = ANY ('{... all the user ids ...}'::text[])) AND (type = 'm.room.member'::text))"
"              Filter: (membership = 'join'::text)"
              ->  Bitmap Index Scan on current_state_events_member_index  (cost=0.00..475.90 rows=17198 width=0)
"                    Index Cond: (state_key = ANY ('{... all the user IDs ...}'::text[]))"
        ->  Index Scan using events_event_id_key on events e  (cost=0.70..3.88 rows=1 width=108)
              Index Cond: (event_id = c.event_id)
              Filter: (c.room_id = room_id)

Fizzadar mentioned this issue Oct 8, 2022

Batch up calls to get_rooms_for_users #14109

Merged

4 tasks

reivilibre closed this as completed in #14109 Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Fizzadar commented Oct 4, 2022 •

edited

Loading

DMRobertson commented Oct 6, 2022

DMRobertson commented Oct 6, 2022

Fizzadar commented Oct 8, 2022

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Comments

Fizzadar commented Oct 4, 2022 • edited Loading

DMRobertson commented Oct 6, 2022

DMRobertson commented Oct 6, 2022

Fizzadar commented Oct 8, 2022

Fizzadar commented Oct 4, 2022 •

edited

Loading