Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

ShadowJonathan · 2021-03-16T20:08:37Z

sender1_1        | 2021-03-16 20:00:14,270 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,335 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,380 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,414 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,451 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,510 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,771 - synapse.metrics.background_process_metrics - 210 - ERROR - process_event_queue_for_federation-19 - Background process 'process_event_queue_for_federation' threw an exception
sender1_1        | 2021-03-16 20:00:29,184 - synapse.federation.sender - 187 - DEBUG - process_event_queue_for_federation-20 - Handling 131486 -> 131586

My federation sender is in a simple loop: it tries to catch up with events, which it does by doing handle_event over batches of 100 events, which calls _send_pdu on every event, which calls store_destination_rooms_entries, which throws an exception due to could not serialize access due to concurrent update, process_event_queue_for_federation task dies, it's finally: does self._is_processing = False, and then another process event queue task can be started in notify_new_events (which starts from scratch)

Here's the error:

sender1_1        | 2021-03-16 20:02:25,657 - synapse.metrics.background_process_metrics - 210 - ERROR - process_event_queue_for_federation-24 - Background process 'process_event_queue_for_federation' threw an exception
sender1_1        | Traceback (most recent call last):
sender1_1        |   File "/usr/local/lib/python3.8/site-packages/synapse/metrics/background_process_metrics.py", line 208, in run
sender1_1        |     return await maybe_awaitable(func(*args, **kwargs))
sender1_1        |   File "/usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py", line 272, in _process_event_queue_loop
sender1_1        |     await make_deferred_yieldable(
sender1_1        | twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'psycopg2.errors.SerializationFailure'>: could not serialize access due to concurrent update
sender1_1        | 
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:517:errback
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:580:_startRunCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:662:_runCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:1514:gotResult
sender1_1        | --- <exception caught here> ---
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:1443:_inlineCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/failure.py:500:throwExceptionIntoGenerator
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:266:handle_room_events
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:254:handle_event
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:330:_send_pdu
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py:315:store_destination_rooms_entries
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:660:runInteraction
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:743:runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py:238:inContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py:254:<lambda>
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/context.py:118:callWithContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/context.py:83:callWithContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py:293:_runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/deprecate.py:298:deprecatedFunction
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/compat.py:403:reraise
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py:284:_runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:738:inner_func
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:532:new_transaction
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py:347:_store_destination_rooms_entries_txn
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:275:execute_batch
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:320:_do_execute
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:275:<lambda>
sender1_1        | /usr/local/lib/python3.8/site-packages/psycopg2/extras.py:1209:execute_batch
sender1_1        | ]]

This causes my federation sender to "stall", with transmission loops constantly sending events while (probably) updating the destination_room table, which causes the process_event_queue_for_federation task to "stall" and start to loop.

The text was updated successfully, but these errors were encountered:

ShadowJonathan · 2021-03-16T20:45:51Z

Also important to note is that at some point i got multiple(?) process_event_queue_for_federation tasks running? (multiple log messages with different numbers intertwined)

That might be a bug, and i'm trying to figure out what could be the cause, it's likely a race condition between _process_event_queue_loop actually kicking off and notify_new_events getting a new event in.

ShadowJonathan · 2021-03-16T21:04:31Z

I investigated a bit, and it seems that this above problem isn't something that's causing this issue in the first place, but it's still notable.

ShadowJonathan · 2021-03-16T21:19:30Z

I've looked around a bit, this is the offending snippet:

synapse/synapse/federation/sender/__init__.py

Lines 280 to 288 in ec621f9

    
           await make_deferred_yieldable( 
        
               defer.gatherResults( 
        
                   [ 
        
                       run_in_background(handle_room_events, evs) 
        
                       for evs in events_by_room.values() 
        
                   ], 
        
                   consumeErrors=True, 
        
               ) 
        
           )

this snippet throws a handle_room_events task for every room into the background, and waits for them all to finish
each handle_room_events call calls handle_event multiple times
which calls _send_pdu
which calls self.store.store_destination_rooms_entries

So you end up with concurrent self.store.store_destination_rooms_entries calls for an overlapping set of destinations, this causes the postgres engine to disagree, and throw an exception into either task, which bubbles up to the process_event_queue_for_federation, which then exits.

(I suspect this is also how the "old" process_event_queue_for_federation's number continues, because the other task still has this logging context)

ShadowJonathan · 2021-03-16T21:35:43Z

The effect of this is that it stalls (stops) federation when events are sent in multiple rooms to the same servers, which can start piling up as more and more events are sent in multiple rooms by multiple users, and eventually make federation-sending stop working entirely (as the crash-and-restart loop keeps triggering the same data race over and over).

ShadowJonathan · 2021-03-17T12:02:28Z

So far i've identified 2 issues here:

When chatting in large rooms, self.store.store_destination_rooms_entries might take a while, which slows down the process queue, which is likely to then kick off this problem by suddenly handling a batch of new events that're from different rooms (in the same servers), this can be fixed by making sure handle_room_events only "flushes" the destination_rooms table with it's entries on the last event it's sending
- (intermittently flushing does not add anything, and only puts more load on the database. In the worst case, the federation sender crashes (entirely) in the middle of processing that batch, but considering it does not write a stream_id position until handle_room_events has finished anyway, it can be compressed to only write to destination_rooms on the last event)
self.store.store_destination_rooms_entries needs to be made concurrent-proof, so that when two coroutines with an overlapping destinations set will not throw an exception into either coroutine when postgres "cannot serialise access"

Legogris · 2021-03-19T12:01:27Z

Thanks for digging into this - I am affected by this as well

heftig · 2021-03-26T15:28:05Z

We got hit by this hard today. I patched synapse with #9639 here but can we have a 1.30.2 for this, please?

erikjohnston · 2021-03-26T15:41:19Z

An RC is due out early next week

ShadowJonathan · 2021-03-26T16:01:28Z

We got hit by this hard today. I patched synapse with #9639 here but can we have a 1.30.2 for this, please?

@heftig Was this the only time, or did this happen earlier? I'd be surprised if this has only been an issue "now".

heftig · 2021-03-26T16:23:05Z

@ShadowJonathan I don't remember if it was an issue before, sorry. Looking at the monitoring, there was a spike of CPU usage when the federation broke (about two hours ago) that ended shortly after I applied the fix (about one hour ago).

ShadowJonathan · 2021-03-26T17:22:26Z

The patch most likely only restarted the federation sender process making it push through that one moment where it could send the troublesome events, but it isn't fixed, it'll be properly fixed with the next 1.31 rc or release 👍

ShadowJonathan changed the title ~~Federation catchup stalls due to concurrent access~~ Federation catchup stalls due to concurrent access exception thrown in the process_event_queue task Mar 16, 2021

ShadowJonathan changed the title ~~Federation catchup stalls due to concurrent access exception thrown in the process_event_queue task~~ Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task Mar 16, 2021

ShadowJonathan mentioned this issue Mar 16, 2021

Do federation catchup in "buckets" #9632

Closed

ShadowJonathan mentioned this issue Mar 17, 2021

Fix federation stall on concurrent access errors #9639

Merged

4 tasks

erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Mar 22, 2021

erikjohnston closed this as completed in #9639 Mar 23, 2021

ShadowJonathan mentioned this issue Mar 23, 2021

Outgoing federation suddenly broken #8747

Closed

chagai95 mentioned this issue Apr 22, 2021

synpase showing unhealthy - might have something to do with "could not serialize access due to concurrent update" #9869

Closed

verymilan mentioned this issue Jul 28, 2023

Outgoing federation is mostly broken after restoring from backup #16025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 16, 2021

ShadowJonathan commented Mar 16, 2021

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 17, 2021 •

edited

Loading

Legogris commented Mar 19, 2021

heftig commented Mar 26, 2021

erikjohnston commented Mar 26, 2021

ShadowJonathan commented Mar 26, 2021 •

edited

Loading

heftig commented Mar 26, 2021

ShadowJonathan commented Mar 26, 2021 •

edited

Loading

Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

Comments

ShadowJonathan commented Mar 16, 2021 • edited Loading

ShadowJonathan commented Mar 16, 2021 • edited Loading

ShadowJonathan commented Mar 16, 2021

ShadowJonathan commented Mar 16, 2021

ShadowJonathan commented Mar 16, 2021 • edited Loading

ShadowJonathan commented Mar 17, 2021 • edited Loading

Legogris commented Mar 19, 2021

heftig commented Mar 26, 2021

erikjohnston commented Mar 26, 2021

ShadowJonathan commented Mar 26, 2021 • edited Loading

heftig commented Mar 26, 2021

ShadowJonathan commented Mar 26, 2021 • edited Loading

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 16, 2021 •

edited

Loading

ShadowJonathan commented Mar 17, 2021 •

edited

Loading

ShadowJonathan commented Mar 26, 2021 •

edited

Loading

ShadowJonathan commented Mar 26, 2021 •

edited

Loading