You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An enhancement of: #12994 (worker-mode support for Faster Remote Room Joins).
Instead of relying on the master to perform the re-syncing of the rooms, we should allow other workers to be involved.
Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.
We should be mindful that in a hypothetical deployment, workers can be taken out of service — a room shouldn't be locked to one worker forever in case this happens, as that would mean the re-sync would never progress.
Aside: in future we should consider moving the /send_join request out of the master process. The obvious candidate is the "client reader" that receives the client-side /join request (and hence currently makes the request to ReplicationRemoteJoinRestServlet). The main thing to worry about then is locking (to ensure that we don't have multiple workers all trying to do the remote-join dance at once). For prior art in that department, we should look at the code that handles incoming events received over federation (https://github.com/matrix-org/synapse/blob/v1.69.0rc2/synapse/federation/federation_server.py#L1108-L1116), which uses a database row to hold a lock: we can simply call try_acquire_lock before starting a resync operation.
That still leaves us with the problem of making sure we resume the partial-state resync if the client reader that is currently processing it gets restarted (or, worse, turned off, never to return). Again following the example of incoming events: in that case, we kick off a processing job as soon as a worker discovers itself to be a "federation inbound" worker by receiving a /send request. Probably we could do the same here on a /_matrix/client/v3/rooms/.*/(send|join|invite|leave|ban|unban|kick) request?
— matrix-org/synapse#12994 (comment)
The text was updated successfully, but these errors were encountered:
This issue has been migrated from #14544.
An enhancement of: #12994 (worker-mode support for Faster Remote Room Joins).
Instead of relying on the master to perform the re-syncing of the rooms, we should allow other workers to be involved.
Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.
We should be mindful that in a hypothetical deployment, workers can be taken out of service — a room shouldn't be locked to one worker forever in case this happens, as that would mean the re-sync would never progress.
The text was updated successfully, but these errors were encountered: