Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Closed
Fizzadar opened this issue Oct 4, 2022 · 3 comments · Fixed by #14109
Closed

Sync workers get stuck, @cached call blocked by slow @cachedList query #14049

Fizzadar opened this issue Oct 4, 2022 · 3 comments · Fixed by #14109
Labels
A-Sync defects related to /sync O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Fizzadar
Copy link
Contributor

Fizzadar commented Oct 4, 2022

Hard to write title, feel free to change! We're facing a problem where a sync worker will stop processing requires entirely blocked by a call to get_rooms_for_users(_with_stream_ordering). I think what is happening is roughly:

  • Sync comes in which is very behind, triggers a huge device list update
  • This then calls get_rooms_for_users around here
  • This then blocks all calls to get_rooms_for_user that is used as part of the regular sync

It seems certain combinations of cache invalidation and request mean every user is included in the cached list call, which in turn blocks all sync requests on that instance until it clears.

The queries are taking minutes to run (and the database is not at max throughput), example:

SELECT c.state_key, room_id, e.instance_name, e.stream_ordering
                FROM current_state_events AS c
                INNER JOIN events AS e USING (room_id, event_id)
                WHERE
                    c.type = 'm.room.member'
                    AND c.membership = 'join'
                    AND c.state_key = ANY(ARRAY['... huge array of thousands of user IDs ...'])
(1 row)

Some thoughts:

  • should calls to @cached block if there's a relevant @cachedList call ongoing?
  • if so, how does the critical sync path not be blocked like above
  • should the calls to get_rooms_for_users be batched, currently unbounded?
  • should we reject sync requests that are super old and force a re-init?

Related: #14037

@squahtx squahtx added A-Sync defects related to /sync S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Occasional Affects or can be seen by some users regularly or most users rarely T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Oct 5, 2022
@DMRobertson
Copy link
Contributor

should the calls to get_rooms_for_users be batched, currently unbounded?

Very probably --- sounds like we need to use batch_iter/chunk_seq or similar to break this up into smaller chunks.

@DMRobertson
Copy link
Contributor

Out of interest, are you able to retrieve the query plan for one of the large queries with thousands of IDs?

@Fizzadar
Copy link
Contributor Author

Fizzadar commented Oct 8, 2022

Yes!: Since bringing in a466164 this is now reduced to a single index lookup since the join is gone, but still makes sense to use batch_iter I think.

Gather  (cost=1479.00..43813.18 rows=1 width=94)
  Workers Planned: 2
  ->  Nested Loop  (cost=479.00..42813.08 rows=1 width=94)
        ->  Parallel Bitmap Heap Scan on current_state_events c  (cost=478.30..27242.74 rows=4007 width=105)
"              Recheck Cond: ((state_key = ANY ('{... all the user ids ...}'::text[])) AND (type = 'm.room.member'::text))"
"              Filter: (membership = 'join'::text)"
              ->  Bitmap Index Scan on current_state_events_member_index  (cost=0.00..475.90 rows=17198 width=0)
"                    Index Cond: (state_key = ANY ('{... all the user IDs ...}'::text[]))"
        ->  Index Scan using events_event_id_key on events e  (cost=0.70..3.88 rows=1 width=108)
              Index Cond: (event_id = c.event_id)
              Filter: (c.room_id = room_id)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants