-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
to_device message didn't get delivered #9533
Comments
Hi @clokep, not sure how the severity levels work for synapse, but just want to clarify that this breaks encryption when it happens. |
yeah, this really isn't minor severity, given it entirely breaks encryption with one or more devices when it happens, with no way to recover. |
moving to S-Critical given it causes data loss. |
@bwindels Are you able to reproduce this? Unfortunately the promotion to S-Critical occurred after the server-side logs had expired, so we'll instead take this issue as an opportunity to double check that our logging will be sufficient to resolve the next time this occurs. We're looking into how we can ensure that we catch new information which should modify the severity. |
Here are some logs I took today using @ alice:mls sends two to-device messages to @ bob:mls. The first one (2021-05-14 21:48:35,516) seemed to be fairly quick, though it was maybe a second or two. The second one (2021-05-14 21:49:40,775) took quite long -- looks like bob's sync doesn't return until (2021-05-14 21:50:10,398) -- about 30 seconds later. |
Closing pending further reports; we're not aware of this occurring since Bruno's initial report. But we'll leave the debug logging cranked up on matrix.org so we can take action if we do see it pop up again. |
I think I am hitting this issue now too. For some reason my Nheko on my laptop can only sometimes send to_device messages to my Element instance on the same laptop. Relevant logs (with logging set to DEBUG):
So synapse is accepting and processing the request, but I can't find it in my Element network tab at all (and it never logs that it is decrypting a to_device message). Weirdly enough, sometimes it does work, but in about 80% of cases it doesn't. To me all the logs look correct, but the event just never arrives... I would assume Nheko is doing something weird here, but it works with a different Element client, that is less busy (not 20 presence events per second) perfectly fine. Just not with this one on the same account... |
Okay, I have one clue now. Disabling presence makes to_device messages a lot more reliable. So there seems to be some race, where a new to_device event gets inserted at the same time of a presence event and then only the presence event gets pulled from the stream maybe? Note that I am running 2 synchrotrons next to my master, event persister and a few federation workers. |
Another data point: Seems to not happen with just 1 /sync worker instead of 2. |
Thank you so much for bringing this back up, those do sound like promising leads. We're pretty slammed until the 1.40 RC goes out (next week), but we'll take up investigating after that. |
Thank you for planning to look into this! I tried myself for a day, but I got a bit lost and couldn't figure out where the messages actually get lost, since the logs look correct even with some extra logging added on my side. At least I know how to work around it in the meantime when I hit it. |
If in doubt, print it out. Thanks Erik for advising.
@deepbluev7 I've added a bunch of logging in ceb29d4, and attached a patch. Would you be able to apply the patch and trigger the issue, then provide logs? It's all guarded on the same config that Rich introduced before, so just need your homeserver and worker configs to have |
Logs for a key request from TDXVQTYIQM to PCSZKSTRBC and immediate reply in the other direction. TDXVQTYIQM is Element Web, PCSZKSTRBC is Nheko. It could of course also be a bug in either of those, but Nheko seems to send it, so maybe Element could be to slow to process the immediate reply. I tried to get around that by adding a sleep already though, so it is probably synapse. master:
synchroron1
Didn't route any traffic to synchrotron2
event creator
|
@deepbluev7 many thanks for those logs. I think you're saying that PCSZKSTRBC responded to the key request, but that TDXVQTYIQM didn't seem to get that response? Assuming that to be the case, I can't see anything obviously wrong on Synapse's end (though my eyeballs are still new). In particular, logging claims we sent down a to_device message in our response to GET-787:
I'd like to understand where the 127.0.0.1 and aiohttp user agent comes from (reverse proxy?) given that this is supposed to be element-web making the request. Are there any corresponding client logs that can corroborate this version of events? E.g.
I think this is the response from PCSZKSTRBC intended directly for TDXVQTYIQM. The stream_id bumps to 197879.
Later on it looks like GET-787 handles a sync for TDXVQTYIQM. Logging claims that
This sync request is from 127.0.0.1 "Python/3.8 aiohttp/3.7.4" which surprised me. I was expecting to see a user agent indicative of Element web from the original key request, namely
But perhaps the Anyway, when do we delete the to_device message? Seems to be GET-790, which is presumably the next /sync from TDXVQTYIQM to use the stream token from GET-787.
|
So I don't think the GET-787 is Element. I have the following UA in my logs: |
Okay, confirmed. Stopping the telegram bridge makes to_device messaging work reliably. |
Okay, this is embarrassing, but I figured out why the to_device messages go missing and this might be a misconfiguration on my side or synapse relying on behaviour it can't assume to be true. Basically the story is as follows: The telegram bridge supports double puppeting. For that it needs a matrix access token. This token I took from the Element login I was testing with (about 1.5 years ago and completely forgot, so it still uses MDAx and I just renamed it at some point). Because of that there are 2 /sync streams for this device, one from Element, one from Telegram. And so the to_device message gets delivered to my Telegram bridge and Element never receives it. Now I'm not sure what should happen, since to_device messages ideally wouldn't get lost with multiple sync streams, but that sounds very hard to implement correctly... So I guess the Telegram bridge docs should be update to recommend to never take the token from a login in use and this may not actually be a bug. I'm sorry for wasting everyone's time, but I really couldn't figure that out. /me digs a hole to vanish into.... |
Closing this issue as the root cause of the most recent report has been identified and resolved. Thank you so much, everyone. |
Just to set it down for the record: we've investigated another apparent instance of a lost |
This turns out to be at the root cause of the main issue that's stopping us from launching native matrix group calling, which uses to-device messages for VoIP calling as per MSC3401 - and so if to-device messages are being dropped, call setup fails and you end up in a splitbrained video conference. The reproducibility is very inconsistent, and it seems that slowing down the server by adding in debugging or jaeger seems to make it harder to reproduce. I caught it in the act however (before dialling up the log level), and have two HARs (sender and receiver) showing 6 to-devices being sent in a row from the sender, but only the last 3 being received in /sync. This is between two local users on robertlong.dev, which is a completely vanilla non-workerised synapse (using sqlite, as it happens). I've tried to reproduce it with just the relevant loglines dialled up, but after 2 hours i haven't managed to. I'll leave the logging on and have another shot at reproducing it tomorrow. Prior to dialling up the logging i saw it about 3 times this evening. The bug is definitely still there; am reopening. Given it's blocking the VoIP launch, I really want to get to the bottom of it. |
so I went on a crusade on this one, and wrote a crappy torture-tester to exercise todevice with MSC3401 style traffic patterns... and i couldn't reproduce it (although client<->server connection failures when stresstesting threw up a lot of confusion). I tried remotely against robertlong.dev, and locally (hammering away at 500 req/s or so); i tried mixing in presence changes and sending state & timeline events; i tried jittering and dejittering; i tried firing off reqs in parallel and in series... but I never conclusively found a toDevice event dropped by the server. (I think i found one at one point, but I can't reproduce and it seems more likely to be a thinko). In the end, I went back to the HARs I captured that illustrated the missing events... and realised that the missing 3 events were sent before the receiving client was launched - and given the launch was a browser page refresh, they must have been received instead by the previous incarnation of the page. So the actual problem with matrix-video-chat and MSC3401 is that it needs to detect and recover from that failure mode (given to-device msgs have no persistence, unlike in-room m.call.* signalling). So, i'm going to cautiously say (again) that this looks to be working okay. The torture testjig is available as very quick and dirty node at https://github.com/ara4n/todevice-collider for the next person (if any) who finds themselves wanting to test their to-device messaging. |
Description
With two clients connected for user
@bruno1e2ee:matrix.org
, deviceACNVYZCKHP
is sending to_device messages and deviceHQAFTINYFN
is syncing. One to_device message didn't get delivered. The request & response for the to_device message is insend.txt
, and the HAR for syncing withHQAFTINYFN
is attached assync.zip
. The device message is expected fromhttps://matrix.org/_matrix/client/r0/sync?since=s1850875933_757284961_12315629_764933098_636406164_2112340_221003142_783098628_175975&timeout=0&filter=1&_cacheBuster=5607312750984851
(can correlate the timestamps in the response headers) but never comes.send.txt
sync.zip
The text was updated successfully, but these errors were encountered: