Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task #9635

Closed
ShadowJonathan opened this issue Mar 16, 2021 · 11 comments · Fixed by #9639
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Mar 16, 2021

sender1_1        | 2021-03-16 20:00:14,270 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,335 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,380 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,414 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,451 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,510 - synapse.storage.txn - 538 - WARNING - process_event_queue_for_federation-19 - [TXN OPERROR] {store_destination_rooms_entries-9e4f} could not serialize access due to concurrent update
sender1_1        | 2021-03-16 20:00:14,771 - synapse.metrics.background_process_metrics - 210 - ERROR - process_event_queue_for_federation-19 - Background process 'process_event_queue_for_federation' threw an exception
sender1_1        | 2021-03-16 20:00:29,184 - synapse.federation.sender - 187 - DEBUG - process_event_queue_for_federation-20 - Handling 131486 -> 131586

My federation sender is in a simple loop: it tries to catch up with events, which it does by doing handle_event over batches of 100 events, which calls _send_pdu on every event, which calls store_destination_rooms_entries, which throws an exception due to could not serialize access due to concurrent update, process_event_queue_for_federation task dies, it's finally: does self._is_processing = False, and then another process event queue task can be started in notify_new_events (which starts from scratch)

Here's the error:

sender1_1        | 2021-03-16 20:02:25,657 - synapse.metrics.background_process_metrics - 210 - ERROR - process_event_queue_for_federation-24 - Background process 'process_event_queue_for_federation' threw an exception
sender1_1        | Traceback (most recent call last):
sender1_1        |   File "/usr/local/lib/python3.8/site-packages/synapse/metrics/background_process_metrics.py", line 208, in run
sender1_1        |     return await maybe_awaitable(func(*args, **kwargs))
sender1_1        |   File "/usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py", line 272, in _process_event_queue_loop
sender1_1        |     await make_deferred_yieldable(
sender1_1        | twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'psycopg2.errors.SerializationFailure'>: could not serialize access due to concurrent update
sender1_1        | 
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:517:errback
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:580:_startRunCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:662:_runCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:1514:gotResult
sender1_1        | --- <exception caught here> ---
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/internet/defer.py:1443:_inlineCallbacks
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/failure.py:500:throwExceptionIntoGenerator
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:266:handle_room_events
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:254:handle_event
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/federation/sender/__init__.py:330:_send_pdu
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py:315:store_destination_rooms_entries
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:660:runInteraction
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:743:runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py:238:inContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py:254:<lambda>
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/context.py:118:callWithContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/context.py:83:callWithContext
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py:293:_runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/deprecate.py:298:deprecatedFunction
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/python/compat.py:403:reraise
sender1_1        | /usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py:284:_runWithConnection
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:738:inner_func
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:532:new_transaction
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py:347:_store_destination_rooms_entries_txn
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:275:execute_batch
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:320:_do_execute
sender1_1        | /usr/local/lib/python3.8/site-packages/synapse/storage/database.py:275:<lambda>
sender1_1        | /usr/local/lib/python3.8/site-packages/psycopg2/extras.py:1209:execute_batch
sender1_1        | ]]

This causes my federation sender to "stall", with transmission loops constantly sending events while (probably) updating the destination_room table, which causes the process_event_queue_for_federation task to "stall" and start to loop.

@ShadowJonathan ShadowJonathan changed the title Federation catchup stalls due to concurrent access Federation catchup stalls due to concurrent access exception thrown in the process_event_queue task Mar 16, 2021
@ShadowJonathan ShadowJonathan changed the title Federation catchup stalls due to concurrent access exception thrown in the process_event_queue task Federation catchup stalls due to "concurrent access" exception thrown in the process_event_queue task Mar 16, 2021
@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Mar 16, 2021

Also important to note is that at some point i got multiple(?) process_event_queue_for_federation tasks running? (multiple log messages with different numbers intertwined)

That might be a bug, and i'm trying to figure out what could be the cause, it's likely a race condition between _process_event_queue_loop actually kicking off and notify_new_events getting a new event in.

@ShadowJonathan
Copy link
Contributor Author

I investigated a bit, and it seems that this above problem isn't something that's causing this issue in the first place, but it's still notable.

@ShadowJonathan
Copy link
Contributor Author

I've looked around a bit, this is the offending snippet:

await make_deferred_yieldable(
defer.gatherResults(
[
run_in_background(handle_room_events, evs)
for evs in events_by_room.values()
],
consumeErrors=True,
)
)

  • this snippet throws a handle_room_events task for every room into the background, and waits for them all to finish
  • each handle_room_events call calls handle_event multiple times
  • which calls _send_pdu
  • which calls self.store.store_destination_rooms_entries

So you end up with concurrent self.store.store_destination_rooms_entries calls for an overlapping set of destinations, this causes the postgres engine to disagree, and throw an exception into either task, which bubbles up to the process_event_queue_for_federation, which then exits.

(I suspect this is also how the "old" process_event_queue_for_federation's number continues, because the other task still has this logging context)

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Mar 16, 2021

The effect of this is that it stalls (stops) federation when events are sent in multiple rooms to the same servers, which can start piling up as more and more events are sent in multiple rooms by multiple users, and eventually make federation-sending stop working entirely (as the crash-and-restart loop keeps triggering the same data race over and over).

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Mar 17, 2021

So far i've identified 2 issues here:

  • When chatting in large rooms, self.store.store_destination_rooms_entries might take a while, which slows down the process queue, which is likely to then kick off this problem by suddenly handling a batch of new events that're from different rooms (in the same servers), this can be fixed by making sure handle_room_events only "flushes" the destination_rooms table with it's entries on the last event it's sending
    • (intermittently flushing does not add anything, and only puts more load on the database. In the worst case, the federation sender crashes (entirely) in the middle of processing that batch, but considering it does not write a stream_id position until handle_room_events has finished anyway, it can be compressed to only write to destination_rooms on the last event)
  • self.store.store_destination_rooms_entries needs to be made concurrent-proof, so that when two coroutines with an overlapping destinations set will not throw an exception into either coroutine when postgres "cannot serialise access"

@Legogris
Copy link
Contributor

Thanks for digging into this - I am affected by this as well

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Mar 22, 2021
@heftig
Copy link

heftig commented Mar 26, 2021

We got hit by this hard today. I patched synapse with #9639 here but can we have a 1.30.2 for this, please?

@erikjohnston
Copy link
Member

An RC is due out early next week

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Mar 26, 2021

We got hit by this hard today. I patched synapse with #9639 here but can we have a 1.30.2 for this, please?

@heftig Was this the only time, or did this happen earlier? I'd be surprised if this has only been an issue "now".

@heftig
Copy link

heftig commented Mar 26, 2021

@ShadowJonathan I don't remember if it was an issue before, sorry. Looking at the monitoring, there was a spike of CPU usage when the federation broke (about two hours ago) that ended shortly after I applied the fix (about one hour ago).

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Mar 26, 2021

The patch most likely only restarted the federation sender process making it push through that one moment where it could send the troublesome events, but it isn't fixed, it'll be properly fixed with the next 1.31 rc or release 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants