You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context: Matrix.org had a bit of a meltdown a few days ago. This led to a series of unable-to-decrypts (UTDs) appearing for various users. As part of element-hq/element-meta#245 I have been looking into UTDs, so I looked at some rageshakes submitted by @ara4n .
In his case, he sent a to-device message from element.io to matrix.org users, which appears to have been delivered to matrix.org. However, the logs on matrix.org:
2024-04-22 08:55:59,186 - synapse.federation.transport.server.federation - 112 - INFO - PUT-878465e2ae4a2c6b-FRA- - Received txn 1713706060091 from element.io. (PDUs: 0, EDUs: 1)
2024-04-22 08:55:59,491 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:55:59,491 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 1s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:00,794 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:00,795 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 2s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:03,101 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:03,101 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 4s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:07,405 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:07,405 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 8s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:15,709 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:15,709 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 16s: ConnectionRefusedError('Connection refused')
2024-04-22 08:56:32,015 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:56:32,015 - synapse.replication.http._base - 318 - WARNING - PUT-878465e2ae4a2c6b-FRA--- - fed_send_edu request connection failed; retrying in 32s: ConnectionRefusedError('Connection refused')
2024-04-22 08:57:04,396 - synapse.http.client - 945 - INFO - PUT-878465e2ae4a2c6b-FRA---- - Error sending request to POST synapse-replication://encryption-1/_synapse/replication/fed_send_edu/m.direct_to_device/UVKnlAjoju: ConnectionRefusedError Connection refused
2024-04-22 08:57:04,396 - synapse.federation.federation_server - 1439 - INFO - PUT-878465e2ae4a2c6b-FRA-- - Failed to handle edu 'm.direct_to_device': SynapseError('502: Failed to talk to encryption-1 process')
2024-04-22 08:57:04,422 - synapse.access.http.15108 - 473 - INFO - PUT-878465e2ae4a2c6b-FRA - 3.74.23.23 - 15108 - {element.io} Processed request: 65.231sec/0.008sec (0.008sec, 0.003sec) (0.008sec/0.003sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1713706060091 HTTP/1.1" "Synapse/1.105.0" [0 dbevts]
To ensure that to-device messages get delivered reliably, matrix.org must ensure it persists the EDU to disk prior to returning the 200 OK from element.io. Looking at these logs, the 200 OK is sent at 08:57:04,422, over a minute from receiving the request.
Conclusion: Synapse will drop received to-device messages if it cannot talk to a worker process in time. This is disastrous as it means we have dropped room keys.
Steps to reproduce
Presumably:
use worker mode synapse
artificially block traffic between federation-inbound and encryption workers.
send a to-device message to the server over federation
watch as it times out trying to talk to the encryption worker, finally sending a 200 OK to /send.
the to-device message won't be retried by the sender (why would it, as it 200 OK'd), and Synapse has no record of it anywhere to retry sending later.
Homeserver
matrix.org and element.io
Synapse Version
1.105
Installation Method
I don't know
Database
PostgreSQL
Workers
Multiple workers
Platform
?
Configuration
No response
Relevant log output
Already attached.
Anything else that would be useful to know?
This is potentially a major cause of UTDs, particularly during meltdowns.
The text was updated successfully, but these errors were encountered:
Oof. I think the correct thing to do here is to either a) insert to_device messages into a table to be retried, or b) fail the federation request so it retries.
/me wonders what other what-goes-wrong idempotency failures exist on the to-device msg path
We've tried to audit most of the code paths, and have followed up on every report that we've received and not found anything. I encourage people to point us at to-device messages that people think have gone missing server side.
... when workers are unreachable, etc.
Fixes#17117.
The general principle is just to make sure that we propagate any
exceptions to the JsonResource, so that we return an error code to the
sending server. That means that the sending server no longer considers
the message safely sent, so it will retry later.
In the issue, Erik mentions that an alternative solution would be to
persist the to-device messages into a table so that they can be retried.
This might be an improvement for performance, but even if we did that,
we still need this mechanism, since we might be unable to reach the
database. So, if we want to do that, it can be a later follow-up.
---------
Co-authored-by: Erik Johnston <erik@matrix.org>
Description
Context: Matrix.org had a bit of a meltdown a few days ago. This led to a series of unable-to-decrypts (UTDs) appearing for various users. As part of element-hq/element-meta#245 I have been looking into UTDs, so I looked at some rageshakes submitted by @ara4n .
In his case, he sent a to-device message from element.io to matrix.org users, which appears to have been delivered to matrix.org. However, the logs on matrix.org:
To ensure that to-device messages get delivered reliably, matrix.org must ensure it persists the EDU to disk prior to returning the 200 OK from element.io. Looking at these logs, the 200 OK is sent at 08:57:04,422, over a minute from receiving the request.
Looking around the Failed to handle edu line,
on_edu
does not appear to persist anything. Checking what calls it, it also does no persistence. This seems to be the case the entire way up the call stack.Conclusion: Synapse will drop received to-device messages if it cannot talk to a worker process in time. This is disastrous as it means we have dropped room keys.
Steps to reproduce
Presumably:
/send
.Homeserver
matrix.org and element.io
Synapse Version
1.105
Installation Method
I don't know
Database
PostgreSQL
Workers
Multiple workers
Platform
?
Configuration
No response
Relevant log output
Anything else that would be useful to know?
This is potentially a major cause of UTDs, particularly during meltdowns.
The text was updated successfully, but these errors were encountered: