_room_pdu_linearizer starves other incoming transactions #9490

richvdh · 2021-02-24T17:54:58Z

I had two transactions arrive at about the same time, each containing one PDU for the same room, each of which had a prev_events ref to a single unknown event.

The first transaction required a significant backfill/state resolution operation, requiring over 90 minutes of activity.

The second transaction waited patiently for in the _room_pdu_linearizer for the first one to finish, and once it finally got to the front of the queue, was processed within 20 seconds. So I'm now 90 minutes behind in that room, and once you get behind it's very hard to catch up.

My questions are:

can we somehow round-robin between transactions for a room, rather than having a single transaction hold the lock for 90 minutes?
Is this lock even required? If I understand correctly, when you have multiple federation_inbound workers, we can end up with each worker doing its own backfill here.

The text was updated successfully, but these errors were encountered:

erikjohnston · 2021-03-16T14:43:33Z

I think the linearizer is there to try and protect against lots of redundant work, in particular if we are busy processing an event A and then get another event B that references A, we don't want to then go and effectively fetch A and start processing it a second time concurrently. Synapse should handle it all correctly, since when we come to persist the events the event persister queues will Do The Right thing with any duplicate data.

Some other thoughts:

When we implemented running multiple federation inbound workers i think we deliberately decided that duplicating work that is happening on other workers isn't the end of the world, at least we have still limited the total amount of duplicate work that can happen at any one time. (Plus cross worker locks is something that sounds hard)
We could refine the lock so that we can process multiple events in a room at a time, so long as there aren't any dependencies, i.e. if we are processing event A then any events that come in that reference A should block waiting for it to be processed, but events that don't reference in flight/queued up events can be processed immediately. We probably do want to limit the in flight amount of work somehow, as otherwise we run the risk of a broken room slowly consuming up all resources on the worker, starving the other rooms from getting events.
😱 90 minutes to process an event
Blocking responding to a transaction isn't ideal if we are spending so long processing events. Ideally we'd write such events to a staging area and return immediately, we could either do this for all events or just for events that require outbound HTTP calls (i.e. have missing prev events, etc). The problem here is that we'd have multiple workers pushing and pulling from the queue, which will require some form of coordination between the workers.

To expand a bit on queuing up events in a staging area:

This allows batching up work, if we get a bunch of events from various servers then we can be more intelligent in how we process them (especially when other servers are sending us random events in "catch up" mode)
Persisting to a staging area means we can respond more quickly, and thus not block more data from being sent while we do complicated stuff
If we only queue up stuff that requires additional work, then we don't have to worry so much about introducing latency.
Plus since its not the end of the world if we do process the events twice, we can do some naive in DB locking to coordinate the work between workers (e.g. have a table with a "this instance has a lock on room X, which will expire in N seconds/minutes")

Synapse 1.37.1 (2021-06-30) =========================== This release resolves issues (such as [matrix-org#9490](matrix-org#9490)) where one busy room could cause head-of-line blocking, starving Synapse from processing events in other rooms, and causing all federated traffic to fall behind. Synapse 1.37.1 processes inbound federation traffic asynchronously, ensuring that one busy room won't impact others. Please upgrade to Synapse 1.37.1 as soon as possible, in order to increase resilience to other traffic spikes. No significant changes since v1.37.1rc1. Synapse 1.37.1rc1 (2021-06-29) ============================== Features -------- - Handle inbound events from federation asynchronously. ([\matrix-org#10269](matrix-org#10269), [\matrix-org#10272](matrix-org#10272))

Synapse 1.37.1 (2021-06-30) =========================== This release resolves issues (such as [#9490](matrix-org/synapse#9490)) where one busy room could cause head-of-line blocking, starving Synapse from processing events in other rooms, and causing all federated traffic to fall behind. Synapse 1.37.1 processes inbound federation traffic asynchronously, ensuring that one busy room won't impact others. Please upgrade to Synapse 1.37.1 as soon as possible, in order to increase resilience to other traffic spikes. No significant changes since v1.37.1rc1. Synapse 1.37.1rc1 (2021-06-29) ============================== Features -------- - Handle inbound events from federation asynchronously. ([\#10269](matrix-org/synapse#10269), [\#10272](matrix-org/synapse#10272))

Synapse 1.37.1 (2021-06-30) =========================== This release resolves issues (such as [matrix-org#9490](matrix-org#9490)) where one busy room could cause head-of-line blocking, starving Synapse from processing events in other rooms, and causing all federated traffic to fall behind. Synapse 1.37.1 processes inbound federation traffic asynchronously, ensuring that one busy room won't impact others. Please upgrade to Synapse 1.37.1 as soon as possible, in order to increase resilience to other traffic spikes. No significant changes since v1.37.1rc1. Synapse 1.37.1rc1 (2021-06-29) ============================== Features -------- - Handle inbound events from federation asynchronously. ([\matrix-org#10269](matrix-org#10269), [\matrix-org#10272](matrix-org#10272))

clokep added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Feb 24, 2021

richvdh mentioned this issue Feb 24, 2021

federation catchup is slow and network-heavy #9492

Open

clokep assigned erikjohnston Mar 16, 2021

erikjohnston mentioned this issue Mar 17, 2021

Make federation catchup send last event from any server. #9640

Merged

erikjohnston mentioned this issue Jun 29, 2021

Handle inbound events from federation asynchronously #10272

Merged

richvdh closed this as completed in c54db67 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_room_pdu_linearizer starves other incoming transactions #9490

_room_pdu_linearizer starves other incoming transactions #9490

richvdh commented Feb 24, 2021

erikjohnston commented Mar 16, 2021

_room_pdu_linearizer starves other incoming transactions #9490

_room_pdu_linearizer starves other incoming transactions #9490

Comments

richvdh commented Feb 24, 2021

erikjohnston commented Mar 16, 2021