Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synapse drops received federated to-device messages if it cannot talk to worker processes #17117

Closed
kegsay opened this issue Apr 24, 2024 · 3 comments · Fixed by #17127
Closed

Comments

@kegsay
Copy link
Contributor

kegsay commented Apr 24, 2024

Description

Context: Matrix.org had a bit of a meltdown a few days ago. This led to a series of unable-to-decrypts (UTDs) appearing for various users. As part of element-hq/element-meta#245 I have been looking into UTDs, so I looked at some rageshakes submitted by @ara4n .

In his case, he sent a to-device message from element.io to matrix.org users, which appears to have been delivered to matrix.org. However, the logs on matrix.org:

2024-04-22 08:55:59,186 - synapse.federation.transport.server.federation - 112 - INFO - PUT-878465e2ae4a2c6b-FRA- - Received txn 1713706060091 from element.io. (PDUs: 0, EDUs: 1)
2024-04-22 08:55:59,491 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:55:59,491 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 1s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:00,794 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:00,795 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 2s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:03,101 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:03,101 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 4s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:07,405 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:07,405 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 8s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:15,709 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:15,709 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 16s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:32,015 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:32,015 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 32s: ConnectionRefusedError('Connection refused')
2024-04-22 08:57:04,396 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to  POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:57:04,396 - synapse.federation.federation_server - 1439 - INFO - PUT-878465e2ae4a2c6b-FRA-- - Failed to handle edu 'm.direct_to_device': SynapseError('502: Failed to talk to encryption-1 process')
2024-04-22 08:57:04,422 - synapse.access.http.15108 - 473 - INFO - PUT-878465e2ae4a2c6b-FRA - 3.74.23.23 - 15108 - {element.io} Processed request: 65.231sec/0.008sec (0.008sec, 0.003sec) (0.008sec/0.003sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1713706060091 HTTP/1.1" "Synapse/1.105.0" [0 dbevts]

To ensure that to-device messages get delivered reliably, matrix.org must ensure it persists the EDU to disk prior to returning the 200 OK from element.io. Looking at these logs, the 200 OK is sent at 08:57:04,422, over a minute from receiving the request.

Looking around the Failed to handle edu line, on_edu does not appear to persist anything. Checking what calls it, it also does no persistence. This seems to be the case the entire way up the call stack.

Conclusion: Synapse will drop received to-device messages if it cannot talk to a worker process in time. This is disastrous as it means we have dropped room keys.

Steps to reproduce

Presumably:

  • use worker mode synapse
  • artificially block traffic between federation-inbound and encryption workers.
  • send a to-device message to the server over federation
  • watch as it times out trying to talk to the encryption worker, finally sending a 200 OK to /send.
  • the to-device message won't be retried by the sender (why would it, as it 200 OK'd), and Synapse has no record of it anywhere to retry sending later.

Homeserver

matrix.org and element.io

Synapse Version

1.105

Installation Method

I don't know

Database

PostgreSQL

Workers

Multiple workers

Platform

?

Configuration

No response

Relevant log output

Already attached.

Anything else that would be useful to know?

This is potentially a major cause of UTDs, particularly during meltdowns.

@erikjohnston
Copy link
Member

Oof. I think the correct thing to do here is to either a) insert to_device messages into a table to be retried, or b) fail the federation request so it retries.

@ara4n
Copy link
Member

ara4n commented Apr 24, 2024

/me wonders what other what-goes-wrong idempotency failures exist on the to-device msg path

@erikjohnston
Copy link
Member

/me wonders what other what-goes-wrong idempotency failures exist on the to-device msg path

We've tried to audit most of the code paths, and have followed up on every report that we've received and not found anything. I encourage people to point us at to-device messages that people think have gone missing server side.

@richvdh richvdh self-assigned this Apr 25, 2024
richvdh added a commit that referenced this issue Apr 29, 2024
... when workers are unreachable, etc.

Fixes #17117.

The general principle is just to make sure that we propagate any
exceptions to the JsonResource, so that we return an error code to the
sending server. That means that the sending server no longer considers
the message safely sent, so it will retry later.

In the issue, Erik mentions that an alternative solution would be to
persist the to-device messages into a table so that they can be retried.
This might be an improvement for performance, but even if we did that,
we still need this mechanism, since we might be unable to reach the
database. So, if we want to do that, it can be a later follow-up.

---------

Co-authored-by: Erik Johnston <erik@matrix.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants