-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635
Comments
Also important to note is that at some point i got multiple(?) That might be a bug, and i'm trying to figure out what could be the cause, it's likely a race condition between |
I investigated a bit, and it seems that this above problem isn't something that's causing this issue in the first place, but it's still notable. |
I've looked around a bit, this is the offending snippet: synapse/synapse/federation/sender/__init__.py Lines 280 to 288 in ec621f9
So you end up with concurrent (I suspect this is also how the "old" |
The effect of this is that it stalls (stops) federation when events are sent in multiple rooms to the same servers, which can start piling up as more and more events are sent in multiple rooms by multiple users, and eventually make federation-sending stop working entirely (as the crash-and-restart loop keeps triggering the same data race over and over). |
So far i've identified 2 issues here:
|
Thanks for digging into this - I am affected by this as well |
We got hit by this hard today. I patched synapse with #9639 here but can we have a 1.30.2 for this, please? |
An RC is due out early next week |
@ShadowJonathan I don't remember if it was an issue before, sorry. Looking at the monitoring, there was a spike of CPU usage when the federation broke (about two hours ago) that ended shortly after I applied the fix (about one hour ago). |
The patch most likely only restarted the federation sender process making it push through that one moment where it could send the troublesome events, but it isn't fixed, it'll be properly fixed with the next 1.31 rc or release 👍 |
My federation sender is in a simple loop: it tries to catch up with events, which it does by doing
handle_event
over batches of 100 events, which calls_send_pdu
on every event, which callsstore_destination_rooms_entries
, which throws an exception due tocould not serialize access due to concurrent update
,process_event_queue_for_federation
task dies, it'sfinally:
doesself._is_processing = False
, and then another process event queue task can be started innotify_new_events
(which starts from scratch)Here's the error:
This causes my federation sender to "stall", with transmission loops constantly sending events while (probably) updating the
destination_room
table, which causes theprocess_event_queue_for_federation
task to "stall" and start to loop.The text was updated successfully, but these errors were encountered: