-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
to_device messages are going missing on matrix.org, causing UISIs 🔥 #6450
Comments
pan's logs at the time of the missing to_device are:
In other words, it looks like /sync doesn't even return for the missing to_device msg. |
Looking at the synchrotron logs (sync3) is a bit more revealing - it looks like pan is maintaining two separate overlapping /sync streams - one with Here are the logs for each /sync request, grouped together (and thus separating all the overlap).
The contract for to_device messages is that once the next /sync in a given stream has been received, the server considers the to_device messages delivered and can delete them serverside (and otherwise not deliver them again). So if your client has two parallel sync streams, if the to_device comes down on the wrong stream and gets discarded, it's game over. I assume this is what's happening here. Alternatively, we have enough logging here to figure out where in the to_device stream the syncs kept asking from, and where the synchotrons calculated up to. Specifically:
(ordered by the receive timestamp, but note that the processed timestamps have quite a different ordering). Correlating this with the actual missing to_device message from the TL;DR: i think the to_device msg went down the wrong sync stream and got discarded by pan. Paging @poljar... |
So this turns out to be a bug in pantalaimon - it's reusing the same access_token for the /sync streams for both its client & pan itself, which means that sometimes to_device messages get erroneously routed to the client which ignores them rather than pan. The fix is for pan to process to_device messages on both streams. @poljar is on the case. |
moved over to matrix-org/pantalaimon#30 |
See #6433.
tl;dr: Just had pantalaimon in #moderation suddenly get UISIs from my riot-web. I can see my riot sending a to_device message with the to_devices, but the to_device never gets received by pantalaimon. Discarding the outbound session in riot-web via /discardsession resolves the problem.
Here's my riot-web trying to start a new outbound session and relay it to pan (from https://github.com/matrix-org/riot-web-rageshakes/issues/1989):
We try to share it with pan:
Meanwhile, Synapse sees this to_device put:
...which looks fine; the same as all the other shares in that slice. However, pantalaimon simply doesn't see it. If it had, it would have logged something like this (as seen after invoking /discardsession):
But instead there is not even 'Decrypting event of type OlmEvent', or any errors from a wedged olm session. The to_device message has simply gone missing.
This could also explain https://github.com/matrix-org/riot-web-rageshakes/issues/1943.
The text was updated successfully, but these errors were encountered: